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Abstract 

This  report  analyzes  the  complexity  of  on-line  reinforcement  learning  algorithms, 
namely  asynchronous  real-time  versions  of  Q-learning  and  value-iteration,  applied  to 
the  problems  of  reaching  any  goal  state  from  the  given  start  state  and  finding  shortest 
paths  from  all  states  to  a  goal  state.  Previous  work  had  concluded  that,  in  many  cases, 
initially  uninformed  (i.e.  tabula  rasa)  reinforcement  learning  was  exponential  for  such 
problems,  or  that  it  was  tractable  (i.e.  of  polynomial  time-complexity)  only  if  the 
learning  algorithm  was  augmented.  We  prove  that,  to  the  contrary,  the  algorithms  are 
tractable  with  only  a  simple  change  in  the  task  representation  (“penalizing  the  agent 
for  action  executions”)  or  initialization  (“initializing  high”).  We  provide  tight  bounds 
on  their  worst-case  complexity,  and  show  how  the  complexity  is  even  smaller  if  the 
state  space  has  certain  special  properties.  We  compare  these  reinforcement  learning 
algorithms  to  other  uninformed  on-line  search  methods  and  to  informed  off-line  search 
methods,  and  investigate  how  initial  knowledge  of  the  topology  of  the  state  space  can 
decrease  their  complexity.  We  also  present  two  novel  algorithms,  the  bi-directional  Q- 
learning  algorithm  and  the  bi-directional  value-iteration  algorithm ,  for  finding  shortest 
paths  from  all  states  to  a  goal  state,  and  show  that  they  are  no  more  complex  than  their 
counterparts  for  reaching  a  goal  state  from  a  given  start  state.  The  worst-case  analy¬ 
sis  of  the  reinforcement  learning  algorithms  is  complemented  by  an  empirical  study  of 
their  average-case  complexity  in  three  domains. 


1  Introduction 


Consider  the  problem  for  an  agent  of  finding  its  way  to  one  of  a  set  of  goal  locations, 
where  actions  consist  of  moving  from  one  intersection  (state)  to  another  (on  a  map,  on 
a  directed  graph,  or  in  a  state  space,  see  Figure  1).  Initially,  the  agent  has  no  knowledge 
of  the  topology  of  the  state  space.  We  consider  two  different  tasks:  reaching  any  goal 
state  and  finding  shortest  paths  from  every  state  to  a  goal  state.  [27]  call  the  former 
task  satisficing  search  (search  for  “all-or-none  solutions”),  whereas  the  latter  one 
corresponds  to  shortest-path  search  for  all  states. 

Off-line  search  methods,  which  first  derive  a  plan  that  is  then  executed,  cannot  be 
used  to  solve  the  path  planning  tasks,  since  the  topology  of  the  state  space  is  initially 
unknown  to  the  agent  and  can  only  be  discovered  by  exploring:  i.e.  executing  actions 
and  observing  their  effects.  Thus,  the  path  planning  tasks  have  to  be  solved  on¬ 
line.  On-line  search  method*,  also  called  incremental  or  real-time  search  methods 
[17],  interleave  search  with  action  execution,  see  Figure  2.  The  algorithms  that  we 
describe  here  perform  only  minimal  computation  between  action  executions,  choosing 
only  which  action  to  execute  next,  and  basing  this  decision  only  on  information  local 
to  the  current  state1  of  the  agent  (and  perhaps  its  immediate  successor  states).  This 
way,  the  time  between  action  executions  is  linear  in  the  number  of  actions  available  in 
the  current  state.  If  this  number  does  not  depend  on  the  size  of  the  state  space,  then 
neither  does  the  search  time  between  action  executions.  Such  methods  have  recently 
been  proven  to  be  very  powerful  if  executed  by  multiple  agents  [12]. 

There  is  a  potential  trade-off  between  exploitation  and  exploration  when  selecting  an 
action,  see  [33].  Exploitation  means  to  behave  optimally  according  to  the  current 
knowledge,  whereas  exploration  means  to  acquire  new  knowledge.  Exploration  con¬ 
sumes  time,  but  may  subsequently  allow  the  agent  to  solve  the  task  faster.  Exploration 
is  necessary  for  the  path  planning  tasks,  since  the  agent  initially  has  no  knowledge  of 
the  topology  of  the  state  space.  A  standard  strategy  to  deal  with  the  exploitation  ex¬ 
ploration  trade-off  is  to  exploit  most  of  the  time  and  to  explore  only  from  time  to  time. 
For  exploration,  the  agent  has  to  overcome  the  limited  experimentation  problem, 
since  it  is  constrained  to  execute  only  one  action  of  its  choice  in  the  current  state,  which 
then  uniquely  determines  the  new  state  of  the  agent.2  Furthermore,  it  might  not  be 
able  to  reverse  the  effect  of  an  action  immediately,  i.e.  to  backtrack  to  its  former  state, 
since  there  might  not  be  an  action  that  leads  from  its  new  state  back  to  its  former 
state. 

We  will  investigate  a  class  of  search  algorithms  which  perform  reinforcement  learning. 
The  application  of  reinforcement  learning  to  on-line  path  planning  problems  has  been 

1When  we  talk  about  “the  current  state  of  the  agent,”  we  always  refer  to  a  state  of  the  state  space 
and  not  to  the  knowledge  of  the  agent,  i.e.  not  to  the  values  of  the  variables  of  the  algorithm  that 
controls  the  agent. 

2Other  researchers,  for  example  [30],  [38],  [24],  and  [19],  have  proposed  planning  schemes  for  the 
case  that  arbitrary  (not  just  local)  actions  can  be  exec'H  at  any  time  (in  a  mental  model  of  the 
state  space,  i.e.  as  part  of  a  “Gcdankenexperiment”).  Our  assumptions  are  more  restrictive. 
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Figure  1:  Navigating  on  a  map 


studied  by  [2],  [5],  [23],  [19],  [24],  and  others.  [35]  showed  that  reaching  a  goal  state  with 
uninformed  (i.e.  tabula  rasa)  reinforcement  learning  methods  can  require  a  number  of 
action  executions  that  is  exponential  in  the  size  of  the  state  space.  [33]  has  shown  that 
by  augmenting  reinforcement  learning  algorithms,  the  problem  can  be  made  tractable. 
We  address  the  question  raised  by  these  results  whether  one  has  to  augment  standard 
reinforcement  learning  algorithms  to  make  them  tractable.  We  will  show  that,  contrary 
to  prior  belief,  reinforcement  learning  algorithms  are  tractable  without  any  need  for 
augmentation,  i.e.  their  run-time  is  a  small  polynomial  in  the  size  of  the  state  space. 
All  that  is  necessary  is  a  change  in  the  way  the  state  space  (task)  is  represented. 

In  this  report,  we  use  the  following  notation.3  S  denotes  the  finite  set  of  states  of  the 
state  space,  and  G  (with  0  ^  G  C  5)  is  the  non-empty  set  of  goal  states.  s,tart  €  5  is 
the  start  state  of  the  agent.  A  domain  is  a  state  space  together  with  a  start  state  and 
a  set  of  goal  states.  A(s)  is  the  finite  set  of  actions  that  can  be  executed  in  s  £  S.  The 
size  of  the  state  space  is  n  :=  |5|,  and  the  total  number  of  actions  is  e  :=  Ml5)! 

(i.e.  an  action  that  is  applicable  in  more  than  one  state  counts  more  than  once). 
Executing  an  action  causes  a  deterministic  state  transition  that  depends  only  on  the 
action  and  the  state  it  is  executed  in  (Markov  assumption).  succ(s,a)  is  the  uniquely 
determined  successor  state  when  a  €  A(s)  is  executed  in  s  G  S.  The  distance  d{s,s') 
between  s  €  5  and  s'  €  S  is  defined  to  be  the  (unique)  solution  of  the  following  set  of 

3 We  use  the  following  seL  of  numbers:  Tv  is  the  set  of  icaL,  .Vo  is  tire  set  of  non-negative  integers, 
W"  is  the  set  of  negative  integers,  and  W0-  is  the  set  of  non-positive  integers. 
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(a)  Off-line  search 


Figure  2:  Off-line  versus  on-line  search 


equations 


d(s,  s') 


I  ° 

|  1  +  min*^,)  d(succ(s,a),s') 


if  s  =  s' 
otherwise 


for  all  s,  s'  €  S 


i.e.  it  is  the  smallest  number  of  action  executions  required  to  reach  state  s'  from 
state  s.  The  goal  distance  gd(s)  of  s  6  S  is  defined  to  be  gd(s)  :=  min,'  eG  dis¬ 
similarly,  the  depth  of  the  state  space  d  is  defined  to  be  d  :=  max,,,'  e5  d{s,  s'),  i.e. 
the  maximum  of  the  distances  between  two  arbitrary  states.  We  assume  that  the  state 
space  is  strongly  connected  (or,  synonymously,  irreducible),  i.e.  every  state  can  be 
reached  from  every  other  state  (formally:  d(s,s')  <  oo  for  all  s,s'  6  S).  Since  the 
state  space  is  strongly  connected,  it  holds  that  d  <  n  —  1  and  n  <  e.  We  also  assume 
that  the  state  space  is  totally  observable,  i.e.  the  agent  can  determine  its  current  state 
with  certainty.4  It  also  knows  at  every  point  in  time  which  actions  it  can  execute  in  its 
current  state  and  whether  its  current  state  is  a  goal  state.  Furthermore,  the  domain  is 
single  agent  and  stationary,  i.e.  does  not  change  over  time. 

Formally,  the  results  of  the  report  are  as  follows.  If  a  good  task  representation  ( “pe¬ 
nalizing  the  agent  for  action  executions”)  or  suitable  initialization  (“initializing  high”) 
is  chosen,  the  worst-case  complexity  of  reaching  a  goal  state  has  a  tight  bound  of 
0(n3)  action  executions  for  Q-learning  (provided  that  the  state  space  has  no  duplicate 
actions)  and  0(n 2)  action  executions  for  value-iteration.  If  the  agent  has  initial  knowl¬ 
edge  of  the  topology  of  the  state  space  or  the  state  space  has  additional  properties, 
these  bounds  can  be  decreased  further.  In  addition,  we  show  that  reinforcement  learn¬ 
ing  methods  for  finding  shortest  paths  from  every  state  to  a  goal  state  are  no  more 
complex  than  reinforcement  learning  methods  that  simply  reach  a  goal  state  from  ihc 
start  state.  (All  proofs  can  easily  be  derived  by  induction  and  are  stated  in  the  ap- 

4[21]  state  results  about  the  worst-case  complexity  of  every  algorithm  for  cases  where  the  states 
are  partially  observable  or  hidden,  i.e.  not  observable  at  all. 
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pendix.)  This  demonstrates  that  one  does  not  need  to  augment  reinforcement  learning 
algorithms  to  make  them  tractable. 


2  Reinforcement  Learning 

Reinforcement  learning  is  learning  from  positive  and  negative  rewards.  (Negative 
rewards  are  often  called  penalties  or  costs).  Every  action5  a  €  A(s)  has  an  immediate 
reward  r(s,  a)  €  71,  that  is  obtained  when  the  agent  executes  the  action.  If  the  agent 
starts  in  s3tart  —  s  6  S  and  executes  actions  for  which  it  receives  immediate  reward  rt 
at  step  (number  of  actions  executed  previously)  t  €  Mo,  then  the  total  reward  that  the 
agent  receives  over  its  lifetime  for  this  particular  behavior  is 

u(*)-.= sS'f,  a) 

«=0 

where  7  €  (0, 1]  is  called  the  discount  factor.  We  say  that  discounting  is  used  if  7  <  1, 
otherwise  no  discounting  is  used. 

Reinforcement  learning  algorithms  find  a  behavior  for  the  agent  that  maximizes  the 
total  reward  for  every  possible  start  state.6  Such  a  behavior  is  usually  specified  as  a 
stationary,  deterministic  policy,  in  the  following  called  policy.  A  policy,  also  called 
action  map  or  state-action  rules, 

/:3-  U*W 

s$S 

where  f(s)  €  A(s)  for  all  s  6  5,  is  an  assignment  of  actions  to  states  that  determines 
which  action  f(s)  €  A(s)  the  agent  has  to  execute  in  its  current  state  s  €  S.  Exe¬ 
cuting  a  policy  makes  the  agent  highly  reactive.  By  using  closed  loop  plans  instead 
of  open  loop  plans,  the  agent  is  able  to  deal  with  uncertain  action  outcomes  (“con¬ 
tingencies”)  and  thus  overcomes  some  of  the  deficiencies  caused  by  more  traditional 
planning  approaches.  Although  the  notion  “policy”  originated  in  the  field  of  Stochastic 
Dynamic  Programming,  similar  schemes  have  been  proposed  in  the  context  of  Artificial 
Intelligence,  for  example  Schopper’s  universal  plans  [26]. 

Finding  an  optimal  policy  solves  the  credit-assignment  problem,  i.e.  which  ac¬ 
tion^)  of  an  action  sequence  to  blame  if  the  total  reward  is  non-optimal.  Usually  it  is 
not  sufficient  to  always  execute  the  action  with  the  largest  immediate  reward,  because 
executing  actions  with  small  immediate  rewards  can  be  necessary  to  make  large  future 
rewards  possible.  This  is  called  the  problem  of  delayed  rewards  or,  alternatively, 
reinforcement  learning  with  delayed  rewards. 

5To  describe  reinforcement  learning,  we  use  the  symbols  of  the  path  planning  domain,  but  overline 
them. 

6Depending  on  the  reinforcement  learning  problem,  the  total  reward  can  become  (plus  or  minus) 
infinity  if  no  discounting  is  used,  and  might  then  not  be  useful  to  discriminate  between  good  and  bad 
behavior  of  the  agent.  In  this  case,  one  can  use  either  the  total  discounted  reward  or  the  average 
reward  per  step  in  the  limit. 
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Reinforcement  learning  algorithms  proceed  in  two  steps  to  find  an  optimal  policy: 

1.  First,  they  determine  the  values  Uopt(s )  for  all  s  £  S.  Uopt{s)  is  the  largest  total 
reward  possible  for  s3tart  =  s.  These  values  are  the  solutions  of  the  following  set 
of  equations 

Uopt(s)  =  max(r(s,a)  +  ~fUopt  (succ(s,  a)))  for  all  s  6  5  (2) 

a£A{s) 

If  7  <  1,  then  the  solution  is  unique.  The  solution  might  not  be  unique  for  7  =  1, 
but  the  differences  Uopt{s)  —  Uopt(s')  for  all  s,s'  €  S  are  [8].  Thus,  any  solution 
determines  the  same  preferences  among  the  states. 

2.  Then,  an  optimal  policy  is  to  execute  action  argmaxae^s^(r(s,  a)  + 
7 Uopt(succ(s,  a)))  in  state  s  G  5. 7 

We  analyze  two  reinforcement  learning  algorithms  that  are  widely  used:  Q-learning 
[34]  and  value-iteration  [4].  Both  can  be  used  off-line,  and  especially  value-iteration 
has  traditionally  been  used  by  updating  all  states  either  synchronously,  for  example 
simultaneously  or  by  performing  a  sweep  over  the  state  space  and  updating  every 
state  in  turn  (“Gauss-Seidel  updating”),  or  asynchronously.  Since  both  Q-learning 
and  value-iteration  are  temporal  difference  methods  and  therefore  incremental,  one 
can  interleave  them  with  action  execution  to  construct  asynchronous  real-time  forms 
that  use  actual  state  transitions.  In  the  following,  we  investigate  these  on-line  versions: 
1-step  Q-learning  and  1-step  value-iteration. 

2.1  Q- Learning 

The  1-step  Q-learning  algorithm8  [35],  in  the  following  called  Q-leain.'ng,  consists 
of  a  termination  checking  step  (line  2),  an  action  selection  step  (line  3),  an  action 
execution  step  (line  4),  and  a  value  update  step  (line  5),  see  Figure  3.  For  now,  we 
leave  the  initial  Q-values  unspecified. 

The  action  selection  step  implements  the  exploration  rule  (“which  state  to  go  to  next”). 
It  is  constrained  to  look  only  at  information  local  to  the  current  state  s  of  the  agent. 
Q-learning  does  not  learn  or  use  an  action  model,  i.e.  the  action  selection  step  does 
not  need  to  predict  succ(s ,  a)  for  an  a  €  A(s).  Instead,  information  about  the  relative 
goodness  of  the  actions  is  stored  in  the  states.  This  includes  a  value  Q(s,a )  in  state 
s  for  each  action  a  G  A(s).  Q(^,a)  approximates  the  optimal  total  reward  received  if 
the  agent  starts  in  s,  executes  a,  and  then  behaves  optimally. 

7read:  “any  action  a  G  A(s)  for  which  r(s,a)  +  'fUopt(succ(s,  a))  =  maxa,gj(j)(r(s,  a’)  + 
7(/opt(succ(s,  a')))” .  If  several  actions  tie,  an  arbitrary  one  of  the  equally  good  actions  can  be  selected. 

8Since  we  consider  only  actions  with  deterministic  outcomes,  we  state  the  Q-learning  algorithm 
with  the  learning  rate  a  set  to  one. 
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1.  Set  s  :=  the  current  state. 

2.  If  s  €  G,  then  stop. 

3.  Select  an  action  a  €  A(s). 

4.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  r(s,a)  and  is  in  state  succ(s,a). 
Increment  the  number  of  steps  taken,  i.e.  set  t  :=  t  -f  1.  */ 

5.  Set  Q(s,a )  :=  r(s,a)  +  iU(succ(s,a)). 

6.  Go  to  1. 

where 

U{s)  :=maxae^(3)Q(s,a) 
at  every  point  in  time. 

Figure  3:  The  Q-learning  algorithm 


The  action  selection  step  can  use  the  Q-values,  but  does  not  need  to.  The  actual 
selection  strategy  is  left  open:  It  could,  for  example,  select  an  action  randomly,  select 
the  action  that  it  has  executed  the  least  number  of  times,  or  select  the  action  with  the 
largest  Q-value.  Exploration  is  termed  undirected  [33]  if  it  uses  only  the  Q-values 
(or  no  information  at  all),  otherwise  it  is  termed  directed.  The  basic  Q-learning 
algorithms  studied  by  us  do  not  maintain  other  local  information  than  the  Q-values 
and  are  therefore  undirected. 

Once  the  selected  action  a  has  been  executed  in  state  s  and  the  immediate  reward  ~(s,  a) 
has  been  received,  the  agent  temporarily  has  access  to  the  Q-values  of  its  former  and 
its  new  state  at  the  same  time,  and  the  value  update  step  adjusts  Q(s,a).  The  1-step 
look-ahead  value  r(s,a)-f  iU(Jucc(s,  a))  is  more  accurate  than,  and  therefore  replaces. 
Q(s,  a).  The  value  update  step  can  also  adjust  other  information  local  to  the  former 
state  if  needed. 

The  Q-learning  algorithm  is  memoryless.  We  call  an  on-line  search  algorithm  mem¬ 
oryless  [14]  if  it  cannot  remember  information  other  than  what  it  has  stored  in  the 
states.  Its  access  to  this  information  is  restricted  to  information  local  to  the  current 
state  of  the  agent.  After  the  execution  of  an  action,  the  algorithm  can  only  propagate 
information  from  the  new  state  of  the  agent  to  its  former  state.  Since  the  algorithm 
cannot  propagate  information  in  the  other  direction  and  has  no  internal  memory,  it 
cannot  easily  accumulate  information:  Information  that  it  had  available  in  its  former 
state  is  inaccessible  in  its  new  state.  The  notion  “memoryless”  is  motivated  by  the 
observation  that  the  size  required  for  the  internal  memory  of  an  on-line  search  algo¬ 
rithm  should  not  depend  on  the  size  of  the  state  space.  If  one  actually  builds  a  robot, 
then  one  can  give  it  only  an  internal  memory  of  finite  size.  If  this  robot  is  to  solve 
tasks  in  state  spaces  of  arbitrary  size,  then  the  size  of  the  internal  memory  needed  by 
the  algorithm  that  controls  the  robot  cannot  depend  on  n.  Thus,  only  algorithms  that 
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1.  s  :=  the  current  state. 

2.  If  s  G  G,  then  stop. 

3.  Select  an  action  a  G  A(s). 

4.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  r(s,a)  and  is  in  state  succ{s,  a). 
Increment  the  number  of  steps  taken,  i.e.  set  t  :=  t  +  1.  */ 

5.  Set  U(s)  :=  maxa€^^)(r(s, a)  +  7?/(succ(s,a))). 

6.  Go  to  1. 


Figure  4:  The  value-iteration  algorithm 


merely  need  a  memory  of  constant  size  (i.e.  finite  state  machines)  can  work  unchanged 
in  arbitrarily  large  state  spaces.  A  memoryless  algorithm  is  an  extreme  example  of 
such  algorithms. 


2.2  Value-Iteration 

The  I-step  value-iteration  algorithm,  in  the  following  called  value-iteration,  is  sim¬ 
ilar  to  the  1-step  Q-learning  algorithm,  see  Figure  4.  Like  the  Q-learning  algorithm,  it 
is  memoryless.  The  difference  is  that  the  value-iteration  algorithm  can  access  r(s,a), 
U(succ(s ,  a))  and  other  information  of  state  succ(s,  a)  for  every  action  a  G  A(s)  in 
the  current  state  s,  whereas  the  Q-learning  algorithm  has  to  estimate  them  with  the 
Q-values  (and  other  information  local  to  the  current  state).  This  difference  is  illus¬ 
trated  graphically  in  Figure  5.  Thus,  the  action  selection  step  of  the  value-iteration 
algorithm  (line  3)  can  always  use  the  current  value  of  U ( succ(s,a ))  to  evaluate  the 
goodness  of  executing  a  in  s,  even  if  this  value  has  changed  since  the  last  execu¬ 
tion  of  a  in  s.  The  expectation  Q(s,a),  however,  changes  only  when  a  is  executed 
in  s  and  can  therefore  be  out-dated.  The  value  update  step  (line  5)  becomes  “Set 
U(s)  :=  ma.xa^j^(r(s,  a)  +  7 U(succ(s,a)))ri  for  value-iteration. 

Value-iteration  shares  with  Q-learning  that  it  does  not  explicitly  infer  the  topology  of 
the  state  space  (i.e.  it  does  not  know  or  learn  a  map),  but  it  must  know  an  action 
model  (i.e.  be  able  to  predict  succ(s,a)  for  all  a  G  A(s)  in  the  current  state  s  G  5). 9 
Whereas  Q-learning  does  not  know  the  effect  of  an  action  before  it  has  executed  it  at 
least  once,  value-iteration  only  needs  to  enter  a  state  at  least  once  to  discover  all  of  its 

9The  agent  knows  a  map  if  it  is  able  to  predict  svcc(s,a)  for  all  s  €  S  and  a  6  -4(s),  no  matter 
which  state  it  is  in.  The  agent  knows  an  action  model  (a  “distributed  map”)  if  it  is  able  to  predict 
succ(s,  a)  for  all  a  €  d(s)  in  its  current  state  s  £  5.  It  is  not  necessarily  able  to  predict  the  outcomes 
of  actions  that  are  executed  in  other  states  than  its  current  state.  Thus,  knowing  a  map  is  more 
powerful  than  knowing  an  action  model. 
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What  the  agent  sees  and  thinks 


(a)  (1-step)  Q-leaming  (b)  (1-step)  value-iteration 


n _  1 

U(bere)  s  -1 

|  U(here)  *  -2 

1 

i  r 

l 

Figure  5:  Reinforcement  learning  algorithms 


successor  states.  Since  value-iteration  is  more  powerful  than  Q-learning,  we  expect  it 
to  have  a  smaller  complexity. 


3  Task  Representation 

To  represent  the  task  of  finding  shortest  paths  as  a  reinforcement  learning  problem,  we 
have  to  specify  5,  sJfarf,  G,  A(s)  for  s  £  5,  succ,  and  r.  We  set  S  :=  5,  s3tart  '■=  s,tart- 
G  :=  G,  A(s)  :=  A(s)  for  s  £  5,  and  use  the  state  transition  function  slice  :=  succ. 


except  that  we 

let  the  lifetime  of  the  agent  in 

formula  1  end  when  it  reaches  a  goal 

state.  Formally, 

S 

:=  S 

&  start 

•  =  Sstart 

G 

:=  G 

A(s) 

=  1 

\  {id}  if  s  G  G 

1  A(s)  otherwise 

for  all  s  6  5 

succ(s,  a) 

[  s  if  a  =  id 

[  succ(s,a)  otherwise 

for  all  s  £  S  and  a  €  A(s) 

r{s,id) 

:= 

0 

for  all  s  £  G 

where  id  is  an  identity  action  (i.e.  it  leaves  the  state  unchanged). 

This  leaves  only  the  immediate  rewards  r(s,a)  for  s  £  S  \  G  :■=  {s  £  S  :  s  £  G}  and 
a  £  A(s)  unspecified.  Every  reward  function  r  can  be  chosen,  as  long  as  it  has  the 
following  property: 

gd(s)  <  gd(s')  &  Uopt{s)  >  Uopt(s’)  for  all  s, s'  €  S  (3) 
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i.e.  a  state  with  a  smaller  goal  distance  has  a  larger  optimal  total  reward  and  vice 
versa.  This  guarantees  that  shorter  paths  to  a  goal  state  are  preferred  over  longer 
paths  which  in  turn  are  preferred  over  paths  that  do  not  lead  to  a  goal  state  (i.e.  let 
the  agent  cycle  in  non-goal  states).  If  condition  3  did  not  hold,  then  reinforcement 
learning  algorithms  would  prefer  longer  paths  to  a  goal  state  over  shorter  ones,  since 
they  maximize  the  total  reward  for  every  state. 

We  consider  two  possible  reward  functions  with  this  property,  both  shown  in  Figure  6. 
(We  use  the  undirected  edge  s  +-+  s'  as  a  shortcut  for  the  two  directed  edges  s  — ♦  s' 
and  s  «—  s'.) 
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3.1  Goal-Reward  Representation 


In  the  goal-reward  representation,  the  agent  is  awarded  for  entering  a  goal  state, 
but  not  rewarded  or  penalized  otherwise.  This  representation  has  been  used  by  [35], 
[33],  [31],  and  [24],  among  others. 


r(s,a) 


1  if  succ(s,a)  6  G 
0  otherwise 


for  s  €  S  \  G  and  a  6  A(s ) 


The  optimal  total  discounted  reward  of  s  €  5  \  G  is  i _  jf  no  discounting  is  used, 

then  the  optimal  total  reward  is  1  for  every  s  €  S  \  G,  independent  of  its  goal  distance, 
since  the  state  space  is  strongly  connected.  Thus,  discounting,  is  necessary  so  that 
shorter  goal  distances  equate  with  larger  optimal  total  rewards,  i.e.  in  order  to  satisfy 
property  3. 


3.2  Act  ion- Penalty  Representation 

In  the  action-penalty  representation,  the  agent  is  penalized  for  every  action  that 
it  executes.  This  representation  has  a  more  dense  reward  structure  than  the  goal- 
reward  representation  (i.e.  the  agent  receives  non-zero  rewards  more  often)  if  goals  are 
relatively  sparse.  It  has  been  used  by  [3],  [2],  and  [13],  among  others. 

r(s,  a)  =  — 1  for  s  €  S  \  G  and  a  6  A(s) 

The  optimal  total  discounted  reward  of  s  €  5  is  (1  —  7a<i^)/(7  —  1).  Its  optimal  total 
undiscounted  reward  is  —gd(s).  Note  that  discounting  can  be  used  with  the  action- 
penalty  representation,  but  is  not  necessary  to  satisfy  restriction  3.  Therefore,  the 
action-penalty  representation  provides  additional  freedom  when  choosing  the  parame¬ 
ters  of  the  reiniorcement  learning  algorithms.  The  (J- values  and  {/-values  are  integers 
if  no  discounting  is  used,  otherwise  they  are  reals.  Integers  have  the  advantage  over 
reals  that  they  need  less  memory  space  and  can  be  stored  without  loss  in  precision. 


3.3  The  Problem  of  Delayed  Rewards 

In  the  goal-reward  representation,  actions  that  enter  a  goal  state  have  a  positive  imme¬ 
diate  reward.  All  other  actions  have  no  immediate  reward  or  penalty  at  all.  Thus,  the 
agent  receives  its  first  non-zero  reward  when  it  enters  a  goal  state  for  the  first  time.  In 
the  action-penalty  representation,  all  actions  (in  non-goal  states)  have  an  immediate 
cost  of  one.  Thus,  the  agent  always  receives  non-zero  rewards  no  matter  which  actions 
it  executes. 

The  problem  of  delayed  rewards  is  present  no  matter  which  of  the  two  representations 
is  used,  since  it  is  not  necessarily  optimal  to  always  execute  the  action  with  the  largest 
immediate  reward.  In  fact,  in  almost  all  states  all  actions  have  the  same  immediate 
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reward  (provided  that  goals  are  relatively  sparse),  namely  0  if  goal-reward  representa¬ 
tion  is  used  or  —1  if  action-penalty  representation  is  used,  but  usually  not  all  actions 
are  equally  preferable. 

To  summarize,  the  agent  immediately  receives  non-zero  rewards  if  action-penalty  rep¬ 
resentation  is  used  but  no  immediate  rewards  if  goal-reward  representation  is  used. 
However,  switching  from  goal-reward  representation  to  action-penalty  representation 
does  not  transform  the  reinforcement  learning  problem  from  one  with  delayed  rewards 
to  one  with  immediate  rewards  and,  thus,  does  not  make  the  credit-assignment  problem 
trivial  to  solve. 


4  Reaching  a  Goal  State  with  Q-Learning 

We  can  now  determine  the  complexity  of  reinforcement  learning  algorithms  for  the  path 
planning  tasks.  We  first  analyze  the  complexity  of  reaching  a  goal  state  for  the  first 
time.  The  agent  does  not  need  to  find  shortest  paths.  It  only  has  to  reach  a  goal  state 
from  the  start  state.  We  use  the  total  number  of  steps  that  the  agent  needs  to  solve 
the  task  in  the  worst -ca^e  as  a  measure  for  the  complexity  of  the  on-line  algorithms. 
The  worst-case  complexity  of  reaching  a  goal  state  provides  a  lower  bound  on  the 
complexity  of  finding  all  shortest  paths,  since  this  cannot  be  done  without  knowing 
where  the  goal  states  are.  By  “worst-case  complexity”  we  mean  an  upper  bound  on 
the  number  of  steps  for  an  uninformed  algorithm  that  holds  for  all  possible  topologies 
of  the  state  space,  start  and  goal  states,  and  tie  breaking  rules  among  indistinguishable 
actions  (i.e.  actions  that  have  the  same  Q-values).  Clearly,  in  order  to  have  a  worst- 
case  complexity  smaller  than  infinity,  an  initially  uninformed  search  algorithm  must 
learn  something  about  the  effects  of  action  executions. 


4.1  Zero-Initialized  Q-Learning  with  Goal-Reward  Repre¬ 
sentation 

Assume  that  a  Q-learning  algorithm  operates  on  the  goal-reward  representation  and  is 
zero- initialized,  i.e.  has  no  initial  knowledge  of  the  topology  of  the  state  space.  Such 
an  algorithm  is  uninformed. 


Definition  1  A  Q-leaming  algorithm  is  initialized  with  q  €  H  (or,  synonymously, 
q -initialized) ,  iff  initially 


Q{s,a) 


0  if  s  6  G 
q  otherwise 


for  all  s  G  5  and  a  €  A(s) 


An  example  for  a  possible  behavior  of  the  agent  in  the  state  space  with  the  reward 
structure  from  Figure  6(a)  is  shown  in  Figure  7.  It  depicts  the  beginning  and  end  of 
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a  trace,  i.e.  state  sequence  that  the  agent  traverses.  The  [/-values  are  shown  in  the 
states,  and  the  Q-values  label  the  actions. 

During  the  search  for  a  goal  state,  all  Q- values  remain  zero.  The  Q-value  of  the 
action  that  leads  the  agent  to  a  goal  state  is  the  first  Q- value  that  changes.  For  all 
other  actions,  no  information  about  the  topology  of  the  state  space  is  remembered  by 
the  agent,  i.e.  it  does  not  remember  the  effects  of  actions.  In  fact,  it  does  not  even 
remember  which  actions  it  has  already  explored.  As  a  consequence,  the  action  selection 
step  has  no  information  on  which  to  base  the  decision  which  action  to  execute  next  if 
it  utilizes  only  the  Q-values  (i.e.  if  it  performs  undirected  exploration).  We  assume 
that  the  agent  selects  actions  randomly,  i.e.  according  to  a  uniform  distribution.10 
Therefore,  the  agent  performs  a  random  walk.  The  same  behavior  can  be  achieved 
without  maintaining  ^-values  or  any  other  information,  simply  by  repeatedly  selecting 
one  of  the  available  actions  at  random  and  executing  it. 

A  random  walk  in  a  strongly  connected,  finite  state  space  reaches  a  goal  state  eventually 
with  probability  one.  The  number  of  steps  needed  can  exceed  every  given  bound  (with 

10If  it  had  a  systematic  bias,  for  example  always  chose  the  smallest  action  according  to  some  given 
ordering,  it  would  not  necessarily  reach  a  goal  state  and  could  therefore  cycle  in  the  state  space  forever. 
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(a)  state  space 


ire  8:  A  domain  ( “reset  state  space” )  for  which  a  random  walk  needs  3x2"  2  —  2 
>s  on  average  to  reach  the  goal  state  (for  n  >  2) 


a  probability  close  to  zero)  so  we  use  the  largest  average  number  of  steps  over  all 
possible  topologies  of  the  state  space,  start  and  goal  states,  rather  than  the  worst-case 
complexity.  The  number  of  steps  needed  on  average  to  reach  a  goal  state  from  state  s 
equals  the  absorption  time  of  s  in  the  Markov  chain  that  corresponds  to  the  random 
walk.  (See  Figure  8(b)  for  an  example  whose  transitions  are  labeled  with  the  transition 
probabilities.)  For  every  s  €.  S  we  introduce  a  variable  xs  €  71  that  represents  the 
average  number  of  steps  needed  until  a  goal  state  is  reached  if  the  agent  starts  in  s. 
These  values  can  be  calculated  by  solving  the  following  set  of  linear  equations 

|  o  if  s  e  G 

X*  =  )  1  +  x^cc[3,a)  Otherwise  for  a11  5  €  S 

v  I  '  o^A(j) 

x,,tort  can  scale  exponentially  with  n,  the  number  of  states.  Consider  for  example 
the  reset  state  space  shown  in  Figure  8.  A  “reset”  state  space  is  one  in  which  all 
states  (except  for  the  start  state)  have  an  action  that  leads  back  to  the  start  state.  It 
corresponds  for  example  to  the  task  of  stacking  n  blocks  if  the  agent  can  either  stack 
another  block  or  scramble  the  stack  in  every  state. 

Theorem  1  (Whitehead  [37])  The  expected  number  of  steps  that  a  zero-initialized 
Q-learning  algorithm  with  goal-reward  representation  needs  in  order  to  reach  a  goal 
state  and  terminate  can  be  exponential  in  n. 


[35]  made  this  observation  for  state  spaces  that  have  the  following  property:  In  every 
state  (except  for  the  border  states),  the  probability  of  choosing  an  action  that  leads 
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start 


goal 


Figure  9:  A  domain  for  which  a  random  walk  needs  2n+1  —  3n  —  1  steps  on  average  to 
reach  the  goal  state  (for  n  >  1) 


away  from  the  (only)  goal  state  is  larger  than  the  probability  of  choosing  an  action  that 
leads  closer  to  the  goal  state.  Consider  for  example  the  version  of  a  one-dimensional 
gridworld  shown  in  Figure  9,  that  is  similar  to  one  in  [33].  In  every  state  (but  the 
border  states)  the  agent  can  execute  three  actions:  one  leads  to  the  right,  the  other 
two  are  identical  and  lead  to  the  left.  If  an  action  is  chosen  randomly,  chances  are  2/3 
that  the  agent  goes  to  the  left  (i.e.  goes  away  from  the  goal  state)  and  only  1/3  that 
it  goes  to  the  right  (i.e.  approaches  the  goal  state). 

This  observation  motivated  [35]  to  explore  cooperative  reinforcement  learning  algo¬ 
rithms  in  order  to  decrease  the  complexity.  [32]  showed  that  even  non-cooperative  rein¬ 
forcement  learning  algorithms  have  polynomial  worst-case  complexity  if  reinforcement 
learning  is  augmented  with  a  directed  exploration  mechanism  that  he  calls  “counter- 
based  Q-learning.”  We  will  show  that  one  does  not  need  to  augment  Q-learning.  It 
is  tractable  if  one  uses  either  the  action-penalty  representation  or  different  initial  Q- 
values.  This  confirms  experimental  observations  of  [11],  [36],  [19],  and  [24],  who  men¬ 
tion  an  improvement  in  performance  during  their  experiments  when  using  the  action- 
penalty  representation  or  initializing  the  Q-values  high  instead  of  using  the  goal-reward 
representation  and  initializing  the  Q- values  with  zero. 


4.2  Using  a  Different  Task  Representation 

Assume  now  that  we  are  still  using  a  zero-initialized  Q-learning  algorithm,  but  let  it  op¬ 
erate  on  the  action-penalty  representation.  Although  the  algorithm  is  still  uninformed, 
the  Q-values  change  immediately,  starting  with  the  first  action  execution,  since  the 
reward  structure  is  dense.  In  this  way,  the  agent  remembers  the  effects  of  previous 
action  executions.  An  example  for  a  possible  behavior  of  the  agent  in  the  state  space 
with  the  reward  structure  from  Figure  6(b)  is  shown  in  Figure  10.  The  17-values  are 
shown  in  the  states,  and  the  Q-values  label  the  actions. 


4.2.1  Complexity  Analysis 

While  we  state  the  following  definitions,  lemmas,  and  theorems  for  the  case  in  which 
no  discounting  is  used,  they  can  easily  be  adapted  to  the  discounted  case  as  outlined 
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later  in  this  chapter. 


Definition  2  Q-values  are  consistent  iff 

0  1  „ .  .  f  for  all  s  6  G  and  a  6  v4(s) 

—  1  +  U(succ(s,a))  J  “  ^  5,0  -  |  for  all  s  €  S  \  G  and  a  €  /l(s) 

Zero-initialized  Q-values  are  consistent. 

Definition  3  Q-values  are  admissible  iff 

0  1  n(  .  (  for  all  s  €  G  and  a  €  A(s) 

—  1  —  gd(succ(s,a))  J  ~  ^  3,0  -  |  for  all  s  €  S  \  G  and  a  €  A(s) 

Consistent  Q-values  are  admissible. 

To  understand  our  choice  of  terminology,  note  the  following  relationship: 

n°P‘(  t  -  /  0  for  all  s  €  G  and  a  €  A(s) 

Q  —  |  4.  Uopt(succ(s,  a))  =  —  1  —  gd(succ(s,a))  for  all  s  €  5  \  G  and  a  6  A(s) 
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since  Uopt{s)  =  —  gd(s)  for  all  s  €  S.  Thus,  Q-values  are  admissible  iff  Qopt(s,a)  < 
Q(s,a )  <  0  for  all  s  €  S  and  a  6  A(s).  This  parallels  the  notion  of  admissibility  that 
is  used  for  heuristic  estimates  in  connection  with  A*-search  (see  for  example  [20]  or 
[22]).  Furthermore,  if  a  heuristic  h  for  the  goal  distance  is  known  that  is  admissible  for 
A* -search,  then  the  following  Q-values  are  admissible  as  well 

.  _  f  0  for  all  s  €  G  and  a  €  A(s) 

Q\sia)  |  — 1  —  h(succ(s,a))  for  all  s  €  S  \  G  and  a  €  A(s) 

Admissible  heuristics  are  known  for  many  search  problems  in  Artificial  Intelligence,  for 
example  for  path  planning  problems  in  a  gridworld  (or  more  general:  on  a  topological 
map)  or  solving  the  8-puzzle. 


Definition  4  A  Q-learning  algorithm  is  admissible  iff  it  uses  the  action-penalty  rep¬ 
resentation,  its  action  selection  step  is  “a  :=  argmax  a,€A^Q[s,  a') ,  ”  and  either 


•  its  initial  Q-values  are  consistent,  and  its  value  update  step  is  “ Set  Q(s,a)  := 
—  1  +  U(succ(s,a)),”n  or 

•  its  initial  Q-values  are  admissible,  and  its  value  update  step  is  “Set  Q(s,a)  := 
min(<2(s,a),  —1  +  U(succ(s,a))).” 

If  a  Q-learning  algorithm  is  admissible,  then  consistent  (admissible)  Q-values  remain 
consistent  (admissible)  after  every  step  of  the  agent  and  are  monotonically  decreasing. 

The  action  selection  step  of  an  admissible  Q-learning  algorithm  always  executes  the 
action  with  the  largest  Q-value.  This  strategy  avoids  the  exploration  exploitation 
conflict,  since  it  always  exploits  (i.e.  executes  the  action  that  currently  seems  to  be 
best)  but  at  the  same  time  explores  sufficiently  often  (i.e.  executes  actions  that  it 
has  never  executed  before).  It  performs  undirected  exploration,  since  it  uses  only  the 
Q-values  for  action  selection  and  no  other  information. 

Call  PG  :=  {s  €  S  :  U(s)  =  0}  3  G  the  set  of  potential  goal  states.  If  the  Q-learning 
algorithm  is  zero-initialized,  then  PG  is  always  the  set  of  states  in  which  the  agent 
has  not  yet  explored  all  of  the  actions.  We  call  an  action  explored  iff  the  agent  has 
executed  it  at  least  once.  If  Q-values  are  consistent,  then 

—  1—  min;d(sncc(s,a),s')  <  Q(s,a)  <  0  for  all  s  6  S  \  G  and  a  6  A(s) 

The  action  selection  step  can  then  be  interpreted  as  using  Q(s,a)  to  approximate 
—  1  —  minygpG  d(succ(s,  a),  s'))  and  tending  (sometimes  unsuccessfully)  to  direct  the 
agent  from  the  current  state  to  the  closest  potential  goal  state  with  as  few  steps  as 
possible  and  make  it  then  take  an  unexplored  action. 

ui.e.  “Set  Q(s,a)  :=  r(s,o)  +  -fU(succ(s,a)),"  where  r(s,a)  =  —1  and  7=1. 
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It  is  easy  to  see  that  an  admissible  Q-learning  algorithm  eventually  reaches  a  goal 
state.  The  argument  parallels  a  similar  argument  for  RTA*-type  algorithms  [17,  25] 
and  is  by  contradiction:  If  the  agent  did  not  reach  a  goal  state  eventually,  it  would  run 
around  in  a  cycle  that  would  not  include  a  goal  state.  For  every  cycle  that  the  agent 
completed,  the  largest  Q-value  of  the  actions  executed  in  the  cycle  would  decrease  by 
at  least  one.  Eventually,  the  values  of  all  actions  executed  in  the  cycle  would  drop 
below  every  bound.  In  particular,  they  would  drop  below  the  $-value  °f  an  action  that, 
when  executed,  would  make  the  agent  leave  the  cycle.  Such  an  action  exists,  since  the 
state  space  is  strongly  connected  and  thus  a  goal  state  can  be  reached  from  within  the 
cycle.  Then,  the  agent  would  leave  the  cycle,  which  is  a  contradiction. 

Now  that  we  know  that  the  algorithm  terminates  (which  also  means  that  the  algorithm 
is  correct,  i.e.  reaches  a  goal  state),  we  are  interested  in  its  complexity.  Lemma  I 
contains  the  central  invariant  for  all  proofs.  It  states  that  the  number  of  steps  executed 
so  far  is  always  bounded  by  an  expression  that  depends  only  on  the  initial  and  current 
Q-values  and,  more  over,  “that  the  sum  of  all  Q- values  decreases  (on  average)  by  one 
for  every  step  taken”  (this  paraphrase  is  grossly  simplified).  A  time  superscript  of  t 
in  Lemmas  1  and  2  refers  to  the  values  of  the  variables  immediately  before  the  action 
execution  step,  i.e.  line  4,  of  step  t. 

Lemma  1  For  all  steps  t  €  A/o  ( until  termination)  of  an  undiscounted,  admissible 
Q-learning  algorithm,  it  holds  that 

CV)  +  E  E  Q°M-t>  £  £  C‘(S,  a)  +  £/V)  -loop1 

s(zS  aG/4(s)  s€.S  a€/l(s) 


and 


loop1  X  Q°(*»°)-X  X  Ql{s’a)’ 

s£S  a€/4(s)  a€^(s) 

where  loop ‘  :=  |{f'  €  {0, . . . ,  t  —  1}  :  s1'  =  sJ  +1}|  (the  number  of  identity  actions ,  i.e. 
actions  that  do  not  change  the  state,  executed  before  t). 

Lemma  2  An  undiscounted,  admissible  Q-leaming  algorithm  reaches  a  goal  state  and 
terminates  after  at  most 

2  X  X  (Q°(5’a)  +  gd{succ{s,a))  +  1)  -  U°{s°) 

s(zS\(j  a€j4(s) 

steps. 

Theorem  2  An  admissible  Q-leaming  algorithm  reaches  a  goal  state  and  terminates 
after  at  most  O(ed)  steps. 

Lemma  2  utilizes  the  invariant  and  the  fact  that  each  of  the  e  different  (^-values  >s 
bounded  by  an  expression  that  depends  only  on  the  goal  distances  to  derive  a  bound 
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on  t.  Since  “the  sum  of  all  Q-values  decreases  (on  average)  by  one  for  every  step 
taken”  according  to  the  invariant,  but  is  bounded  from  below,  the  algorithm  must 
terminate.  Since  gd(s)  <  d  for  all  s  6  5,  the  result  from  Theorem  2  follows  directly. 
0(ed)  <0(en),  since  d  <  n  —  1  in  every  strongly  connected  state  space.  (Remember 
that  n  denotes  the  number  of  states,  e  the  total  number  of  actions,  and  d  the  depth 
of  the  state  space.)  Thus,  an  admissible  Q-learning  algorithm  reaches  a  goal  state  and 
terminates  after  at  most  0(en)  steps.  This  worst-case  performance  provides,  of  course, 
an  upper  bound  on  the  average-case  performance  of  the  algorithm. 

Theorem  2  provides  an  upper  bound  on  the  complexity  of  every  admissible  Q-learning 
algorithm,  including  a  zero-initialized  algorithm.  To  demonstrate  that  0(en)  is  a  tight 
bound  for  a  zero-initialized  Q-learning  algorithm,  we  show  that  it  is  also  a  lower  bound. 
Lower  bounds  can  be  proved  by  example,  i.e.  by  showing  a  domain  for  which  the  Q- 
learning  algorithm  can  need  this  many  steps  to  reach  a  goal  state.  Such  a  domain  is 
depicted  in  Figure  11.  Thus,  the  worst-case  complexity  of  reaching  a  goal  state  with 
admissible,  zero-initialized  Q-learning  is  tight  at  0(en). 

4.2.2  Comparison  of  Q-Learning  to  other  On-Line  Search  Algorithms 

We  define  an  (initially)  uninformed  on-line  search  algorithm  to  be  an  algorithm 
that  does  not  know  the  effect  of  an  action  (i.e.  does  not  know  which  successor  state 
the  action  leads  to)  before  it  has  executed  it  at  least  once.  One  example  of  such  an 
algorithm  is  zero-initialized  Q-learning. 

A  highly  reactive  algorithm,  such  as  Q-learning,  often  executes  actions  that  are  sub- 
optimal  (when  judged  according  to  the  knowledge  that  the  agent  could  have  acquired 
if  it  had  memorized  all  of  its  experiences).  For  example,  the  agent  can  move  around 
for  a  long  time  in  parts  of  the  state  space  that  it  has  already  explored,  thus  neither 
exploring  unknown  parts  of  the  state  space  nor  having  a  chance  to  find  a  goal.  This 
can  be  avoided  when  planning  further  into  the  future:  the  agent  learns  a  model  of 
the  state  space  (“world  model”,  which  is  in  our  case  a  map),  and  uses  it  to  predict  the 
effects  of  action  executions.  This  enables  the  agent  to  make  a  more  informed  decision 
about  which  action  to  execute  next. 

The  DYNA  architecture  [30]  [31],  for  example,  implements  look-ahead  planning  in  the 
framework  of  Q-learning.  Actions  are  executed  in  the  real  world  mainly  to  refine  (and, 
in  non-stationary  environments,  to  update)  the  model.  The  model  is  used  to  simulate 
the  execution  of  actions  and  thereby  to  create  experiences  that  are  indistinguishable 
from  the  execution  of  real  actions.  This  way,  the  real  world  and  the  model  can  inter¬ 
changeably  be  used  to  provide  input  for  Q-learning.  Using  the  model,  the  agent  can 
optimize  its  behavior  according  to  its  current  knowledge  without  having  to  execute 
actions  in  the  real  world.  Also,  the  agent  can  simulate  the  execution  of  arbitrary  (not 
just  local)  actions  at  any  time.  Various  researchers,  for  example  [24]  and  [19],  have 
devised  strategies  which  actions  to  simulate  in  order  to  speed-up  planning. 

There  is  a  trade-off:  When  learning  a  map  and  using  it  for  planning,  the  agent  has 
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Figure  11:  A  domain  for  which  an  admissible,  zero-initialized  Q-iearning  algorithm  (and 
every  other  uninformed  on-line  search  algorithm)  can  need  at  least  (e  —  n  +  l)(n  —  1) 
steps  to  reach  the  goal  state  (for  n  >  2  and  e  >  n) 


to  keep  more  information  around  and  perform  more  computations  between  action  ex¬ 
ecutions,  which  increases  its  deliberation  time  between  action  executions.  However, 
chances  are  that  the  agent  needs  less  steps  to  reach  a  goal  state.  In  the  following,  we 
compare  the  worst-case  complexity  of  the  Q-learning  algorithm,  that  does  not  learn  a 
map,  to  the  worst-case  complexity  of  any  other  uninformed  search  algorithm,  e.g.  one 
that  learns  a  map  and  uses  it  for  planning. 

Figure  11  shows  that  every  uninformed  on-line  search  algorithm  has  a  worst-case  com¬ 
plexity  of  at  least  O(en).  If  ties  are  broken  in  favor  of  actions  that  lead  to  states  with 
smaller  numbers,  then  every  uninformed  on-line  search  algorithm  can  traverse  a  super¬ 
sequence  of  the  following  state  sequence:  e  —  n  +  1  times  the  sequence  123 ...  n  —  1, 
and  finally  n.  Basically,  all  of  the  0(e)  actions  in  state  n  —  1  are  executed  once.  All 
of  these  lead  the  agent  back  to  state  1  and  therefore  force  it  to  execute  O(n)  explored 
actions  before  it  can  explore  a  new  action.  Thus,  the  worst-case  complexity  is  at  least 
O(en). 

We  had  already  seen  that  one  could  decrease  the  complexity  of  Q-learning  dramatically 
(i.e.  in  some  cases  exponentially)  by  choosing  the  action-penalty  representation  over 
the  goal-reward  representation.  Now,  we  have  shown  that  every  uninformed  on-line 
search  algorithm  has  at  least  the  same  worst-case  complexity  as  a  zero-initialized  Q- 
learning  algorithm  with  action- penalty  representation.  This  does  not  mean,  however, 
that  such  a  Q-learning  algorithm  is  the  best  possible  algorithm  for  reaching  a  goal  state. 
In  the  following,  we  show  that  there  exists  an  uninformed  on-line  search  algorithm  that 
strictly  dominates  an  admissible,  zero-initialized  Q-learning  algorithm.  We  say  that  an 
algorithm  X  strictly  dominates  an  algorithm  Y  if  X  always  performs  no  worse  (i.e. 
needs  no  more  steps)  than  Y  and  performs  strictly  better  in  at  least  one  case. 

Consider  an  algorithm  that  maintains  a  map  of  the  part  of  the  state  space  that  it  has 
explored  so  far.  It  executes  unexplored  actions  in  the  same  order  as  zero-initialized 
Q-learning,  but  always  chooses  the  shortest  known  path  to  the  next  unexplored  action: 
The  agent  uses  its  model  of  the  world,  i.e.  the  (partial)  map  and  its  knowledge  of  the 
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this  part  of  the  state  space  is  totally  connected 


Figure  12:  A  domain  for  which  an  admissible,  zero-initialized  Q-learning  algorithm 
can  need  at  least  1/1 6n3  —  3/16 n2  —  1/1 6n  -(-  3/16  steps  to  reach  the  goal  state,  but 
Qmap-learning  needs  only  at  most  3/8n2  +  3/2 n  —  23/8  steps  (for  odd  n  >  3) 


Q- values,  to  simulate  its  behavior  under  the  Q-learning  algorithm  until  it  would  execute 
an  unexplored  action.  Then,  it  uses  the  map  to  find  the  shortest  known  action  sequence 
that  leads  from  its  current  state  in  the  world  to  the  state  in  which  it  can  execute  this 
action,  executes  the  action  sequence  and  the  unexplored  action,  and  repeats  the  cycle. 
We  call  this  algorithm  the  Qmap-learning  algorithm.  Per  construction,  it  cannot 
perform  worse  than  zero-initialized  Q-learning  (no  matter  what  the  tie-breaking  rule 
is)  if  ties  are  broken  in  the  same  way  .  Thus,  the  worst-case  complexity  of  Qmap-learning 
over  all  tie-breaking  rules  cannot  be  worse  than  the  one  of  Q-learning.  Consider,  for 
example,  the  state  sequence  that  Q-learning  traverses  in  a  reset  state  space  (shown 
in  Figure  8)  of  size  n  =  6  if  ties  are  broken  in  favor  of  actions  that  lead  to  states 
with  smaller  numbers:  1212312341234512123456.  First,  Q-learning  finds  out  about  the 
effect  of  action  a i  in  state  1  and  then  about  a2  in  2,  a\  in  2,  a2  in  3,  cq  in  3,  a2  in  4, 
ai  in  4,  a2  in  5,  and  ai  in  5,  in  this  order.  The  Qmap-learning  algorithm  explores  the 
actions  in  the  same  order.  However,  after  it  has  executed  action  a2  in  state  5  for  the 
first  time,  it  knows  how  to  reach  state  5  again  faster  than  Q-learning:  it  goes  from  state 
1  through  states  2,  3,  and  4,  to  state  5,  whereas  Q-learning  goes  through  states  2,  1, 
2,  3,  and  4.  Thus,  the  Qmap-learning  algorithm  traverses  the  following  state  sequence: 
12123123412345123456,  and  is  two  steps  faster  than  Q-learning.  Figure  12  gives  an 
example  of  a  domain  for  which  the  big-0  worst-case  complexities  of  the  two  algorithms 
are  different:  There  is  a  tie-breaking  rule  that  causes  Q-learning  to  need  0(n3)  steps 
to  reach  the  goal  state,  whereas  Qmap-learning  needs  at  most  0(n2)  steps  no  matter 
how  ties  are  broken.  In  short:  Q-learning  can  require  0(n3)  steps  to  reach  the  goal 
state,  whereas  Qmap-learning  reaches  the  goal  state  with  at  most  0(n2)  steps. 

The  Qmop-learning  algorithm  is  mainly  of  theoretical  interest,  because  it  demonstrates 
that  there  exist  algorithms  that  dominate  Q-learning,  and  the  domination  proof  is  easy. 
However,  algorithms  whose  behavior  resembles  the  one  of  the  Qmap-learning  algorithm 
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are  not  only  of  theoretical  interest: 

One  idea  behind  the  DYNA  architecture  is  that  executing  actions  in  the  real  world  is 
slow  (and  expensive),  whereas  simulating  the  execution  of  actions  in  a  model  of  the 
world  is  fast  (and  inexpensive).  Therefore,  planning  should  exclusively  be  done  in  the 
model  if  possible.  Once  an  action  is  explored  in  the  real  world,  it  can  be  integrated  in 
the  model.  The  effect  of  an  action  does  not  change  over  time,  since  the  state  space  is 
stationary.  Consequently,  actions  should  only  be  executed  in  the  real  world 

•  to  learn  the  effect  of  the  executed  (unexplored)  action, 

•  to  get  the  agent  into  a  state  in  which  it  can  find  out  about  the  effect  of  an 
unexplored  action,  or 

•  to  get  the  agent  to  a  goal. 

If  executing  actions  in  the  real  world  and  simulating  action  executions  in  the  model  take 
approximately  the  same  time,  then  there  is  a  trade-off.  Simulating  actions  allows  the 
agent  to  utilize  its  current  knowledge  better,  whereas  executing  actions  in  the  real-world 
increases  its  knowledge  and  allow  it  to  stumble  across  a  goal.  This  trade-off  has  for 
example  been  investigated  by  [15]  and  [16].  Using  the  model  provides  the  agent  with  the 
advantage  that  all  actions  can  be  simulated  at  any  time,  whereas  the  world  constrains  it 
to  execute  only  actions  in  its  current  state.  Reinforcement  learning  researchers  usually 
assume  that  the  agent  can  simulate  x  action  executions  for  every  action  execution  in 
the  real  world.  Then,  the  problem  arises  which  actions  to  simulate.  Reinforcement 
learning  researchers  have  proposed  real-time  schemes  for  planning  in  the  world  model 
of  a  DYNA  architecture  that  are  used  to  approximate  the  following  behavior  of  the 
agent:  “If  the  current  world  state  is  a  goal  state,  stop.  Otherwise,  go  to  the  closest 
state  with  an  unexplored  action,  execute  it,  and  repeat.”  This  way,  one  prevents  the 
agent  from  unnecessarily  executing  actions  that  it  has  already  explored,  which  is  also 
the  objective  of  the  Qmap-learning  algorithm.  Different  researchers  update  the  Q- values 
in  the  model  in  different  orders.  For  example,  [23]  and  [9]  study  reinforcement  learning 
algorithms  that  have  a  look-ahead  larger  than  one  or  perform  best-first  search.  In 
contrast,  [19]  updates  those  Q-values  in  the  model  first  that  are  expected  to  change 
most.  Note  that  we  have  not  included  the  planning  time  in  the  complexity  measure. 
If  planning  time  is  not  negligible  compared  to  execution  time,  then  the  total  run-time 
(i.e.  the  sum  of  planning  and  execution  time)  can  increase  even  if  the  number  of  steps 
needed  (i.e.  the  execution  time)  decreases. 

The  relationships  between  an  admissible,  zero-initialized  Q-learning  algorithm,  the 
Qmap-learning  algorithm,  and  uninformed  on-line  search  algorithms  in  general  are  sum¬ 
marized  in  Figure  13  (for  domains  that  have  no  duplicate  actions,  see  Chapter  4.4.1). 
They  show  that  it  can  be  misleading  to  focus  only  on  the  worst-case  complexity  of 
an  on-line  search  algorithm  over  all  domains.  We  have  demonstrated  that  there  exists 
an  uninformed  on-line  search  algorithm,  namely  Q-learning,  that  reaches  a  goal  state 
in  0(n3)  steps  no  matter  what  the  domain  is.  There  are  domains  in  which  no  unin¬ 
formed  on-line  search  algorithm  can  do  better.  However,  there  exists  an  uninformed 
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#  =  corresponds  to  a  data  point  proved  to  hold 


Figure  13:  A  diagram  showing  the  relationships  between  an  admissible,  zero-initialized 
Q-learning  algorithm,  the  ^map-learning  algorithm,  and  uninformed  on-line  search  al¬ 
gorithms  in  general  (data  points  for  domains  other  than  the  ones  shown  in  Figures  12 
and  15  are  fictitious) 


on-line  search  algorithm,  namely  Qmap- learning,  that  always  performs  no  worse  than 
Q-leaming,  but  reduces  the  number  of  action  executions  by  more  than  a  constant  fac¬ 
tor  in  at  least  one  domain.  To  summarize,  although  every  uninformed  on-line  search 
algorithm  has  at  least  the  same  big-0  worst-case  complexity  as  an  admissible,  zero- 
initialized  Q-learning  algorithm,  learning  a  map  of  the  state  space  (and  subsequently 
using  it  for  planning)  can  decrease  the  big-0  worst-case  complexity  for  some  (but  not 
all)  domains. 

4.2.3  Comparison  of  Q-Learning  to  Off-Line  Search  Algorithms 

The  ratio  of  the  worst-case  effort  spent  by  a  given  on-line  algorithm  to  solve  a  given 
problem  and  the  worst-case  effort  spent  by  the  best  informed  off-line  algorithm  to  solve 
the  same  problem  is  called  the  competitive  ratio  of  the  on-line  algorithm.  An  on-line 
algorithm  is  called  competitive  if  its  competitive  ratio  is  bounded  from  above  by  a 
constant  (i.e.  the  upper  bound  is  independent  of  the  problem  size)  [29].  Figure  14 
shows,  as  expected,  that  no  uninformed  on-line  search  algorithm  is  competitive:  they 
need  O(en)  steps  in  the  worst  case,  but  an  off-line  algorithm  that  knows  the  topology 
of  the  state  space  can  reach  the  goal  state  in  only  one  step.12  On-line  search  algorithms 

12One  could  argue  that  it  would  be  more  appropriate  to  use  the  maximum  of  the  ratio  of  the  average- 
case  effort  spent  by  the  on-line  algorithm  and  the  average-case  effort  spent  by  the  best  totally  informed 
off-line  algorithm.  For  the  path  planning  tasks,  the  average-case  complexity  of  the  off-line  algorithm 
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Figure  14:  A  domain  for  which  an  admissible,  zero-initialized  Q-learning  algorithm  (and 
every  other  uninformed  on-line  search  algorithm)  can  need  at  least  (e  —  n)(n  —  2)  -f  1 
steps  to  reach  the  goal  state,  but  an  off-line  search  algorithm  that  knows  the  topology 
of  the  state  spa^e  needs  only  one  step  to  reach  the  goal  state  (for  n  >  3  and  e  >  n  +1 ) 


with  a  look-ahead  of  one  step  (such  as  Q-learning),  that  reduce  deliberation  between 
action  executions  to  a  minimum,  and  off-line  search  algorithms  represent  extreme  points 
on  the  deliberation  action  scale,  with  on-line  search  algorithms  that  have  a  limited  look¬ 
ahead  larger  than  one  in  between.  The  competitive  ratio  of  0(en)  demonstrates  that 
one  increases  the  number  of  steps  executed  by  a  factor  of  0(en )  at  most  when  moving 
from  the  reactive  extreme  of  the  deliberation  action  scale  to  the  deliberative  extreme. 


4.2.4  Discounting 

All  properties  of  an  undiscounted,  admissible  Q-learning  algorithm  can  immediately 
be  transferred  to  the  case  in  which  discounting  is  used,  since  there  is  a  strictly 
monotonically  increasing  bijection  between  the  Q- values  of  the  former  algorithm  af¬ 
ter  step  t  and  the  Q- values  of  the  latter  algorithm  after  the  same  step,  namely:  a 
Q-value  of  Q*(s,a)  —  q  6  N’o  in  the  undiscounted  case  corresponds  to  a  C^-value  of 
Ql(s,a)  =  (1  —  7-?)/(7  —  1)  in  the  discounted  case.  Since  both  algorithms  always 
execute  the  action  with  the  largest  Q-value,  they  always  cho  >se  the  same  action  for 
execution  (if  ties  are  broken  in  the  same  way).  Thus,  they  behave  identically. 

equals  its  worst-case  complexity,  since  action  outcomes  are  deterministic.  The  complexity  of  the  on¬ 
line  algorithm,  to  the  contrary,  depends  on  how  ties  are  broken,  which  is  a  random  process.  Therefore, 
its  average-case  complexity  can  be  smaller  than  its  wos  st-case  complexity.  However,  Figure  14  also 
shows  that  no  uninformed  on-line  search  algorithm  is  competitive,  even  under  this  relaxed  definition 
of  competitiveness.  We  continue  to  use  our  original  definition,  since  we  are  for  now  more  interested 
in  the  worst-case  complexity  of  the  algorithms  than  in  their  average-case  performance. 
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4.3  Using  Different  Initial  Q- Values 


We  now  analyze  Q-Iearning  algorithms  that  operate  on  the  goal-reward  representation, 
but  are  one-initialized.  A  similar  initialization  has  been  used  before,  for  instance  in 
experiments  conducted  by  [11].  It  assumes  no  prior  knowledge  of  the  topology  of  the 
state  space  if  one  makes  the  reasonable  assumption  that  the  value  update  step  knows 
after  the  execution  of  an  action  whether  the  new  state  of  the  agent  is  a  goal  state  or 
not.  Then,  the  Q- values  of  actions  in  non-goal  states  can  be  initialized  differently  from 
the  <5- values  of  actions  in  goal  states.  We  assume  (again)  that  the  algorithm  uses  an 
action  selection  step  that  executes  the  action  with  the  largest  Q-value. 

If  the  value  update  step  knows  whether  the  new  state  of  the  agent  is  a  goal  state,  one 
can  set  the  initial  (^-values  of  actions  in  non-goal  states  to  —  1  and  the  ones  of  actions  in 
goal  states  to  0.  Assigning  —1  to  non- goal  states  reflects  the  fact  that  the  agent  knows 
that  the  state  is  a  non-goal  state  once  it  is  in  the  state  and  therefore  can  conclude  that 
it  takes  at  least  one  action  execution  to  reach  a  goal  state.  Such  an  undiscounted,  (mi¬ 
nus  one)-initialized  Q-learning  algorithm  with  action-penalty  representation  behaves 
identically  to  a  discounted  one-initialized  Q-learning  algorithm  with  goal-reward  rep¬ 
resentation,  since  theie  is  a  strictly  monotonically  increasing  bijection  between  the 
Q- values  of  the  former  algorithm  after  step  t  and  the  Q- values  of  the  latter  algorithm 
after  the  same  step,  namely:  A  Q-value  of  Q‘(s,a)  =  q  €  M~  (Qt(s,a)  =  0)  for  a 
(minus  one)-initialized  Q-learning  algorithm  with  action-penalty  representation  corre¬ 
sponds  to  a  Q-value  of  Qf(s,a)  =  (Q‘(s,a)  =  0)  for  a  one-initialized  Q-learning 

algorithm  with  goal-reward  representation.13  (As  argued  in  Chaper  3.1,  discounting  is 
necessary  if  goal- reward  representation  is  used.)  Since  both  algorithms  always  execute 
the  action  with  the  largest  Q-value,  they  always  choose  the  same  action  for  execution 
(if  ties  are  broken  in  the  same  way).  Thus,  they  behave  identically.  Since  (minus  one)- 
initialized  Q- values  are  consistent,  Theorem  2  applies  to  the  former  algorithm.  This 
leads  to  the  following  conclusion. 

Theorem  3  A  discounted,  one-initialized  Q-learning  algorithm  with  goal-reward  rep¬ 
resentation  reaches  a  goal  state  and  terminates  after  at  most  0{ed )  steps  if  its  ac¬ 
tion  selection  step  is  “ a  :=  argmaxa,eA^Q(s,a')”  and  its  value  update  step  is  “ Set 
Q{s,a)  :=  r(s,a)  +  ■jU{succ(s,a)).” 

The  other  results  and  remarxs  from  the  previous  chapter  can  be  transferred  as  well. 

l3This  argument  generalizes  to  the  bi-directional  Q-learning  algorithm,  that  is  stated  later  in  this 
report.  If  the  task  is  only  to  reach  a  goal  state  and  stop  (i.e.  the  task  that  we  discuss  in  this  chapter), 
then  a  zero-initialized  Q-learning  algorithm  with  action-penalty  representation  also  behaves  identically 
to  a  discounted,  one-initialized  Q-learning  algorithm  with  goal-reward  representation.  In  this  case,  it 
does  not  matter  how  the  Q-\alues  of  actions  in  goal  states  are  initialized. 
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4.4  Gridworlds  and  other  Common  Reinforcement  Learning 
Domains 

Gridworlds  and  other  domains  studied  in  the  context  of  reinforcement  learning  can 
have  special  properties  such  as  the  following. 


Definition  5  A  state  space  has  no  identity  actions  iff,  for  all  s  €  S  and  a  €  A(s), 
succ(s,a)  7^  s,  i.e.  there  are  no  actions  that  leave  the  state  unchanged. 


If  the  agent  knows  that  an  action  is  an  identity  action,  it  does  not  need  to  execute 
it,  since  it  is  never  optimal  to  execute  an  identity  action.  However,  according  to  our 
assumptions  the  agent  is  not  aware  of  the  fact  whether  an  action  is  an  identity  action. 
For  example,  sometimes  it  is  assumed  that  the  location  of  an  agent  in  a  gridworld 
does  not  change  when  the  agent  tries  to  move  forward,  but  bumps  into  an  obstacle. 
Although  the  absence  of  identity  actions  can  potentially  simplify  the  path  planning 
tasks,  all  of  the  big-0  worst-case  complexities  stated  in  this  report  remain  unchanged 
if  the  state  space  has  no  identity  actions.  Thus,  all  of  the  state  spaces  that  we  use 
as  examples  (except  for  the  one  in  Figure  22)  have  no  identity  actions.  However, 
other  properties  of  reinforcement  learning  domains  can  decrease  the  big-0  worst-case 
complexity  of  reaching  a  goal  state  with  Q-learning.  In  the  following,  we  will  identify 
such  properties. 


4.4.1  State  Spaces  with  no  Duplicate  Actions,  Constant  Individual  Upper 
Action  Bounds,  Linear  Total  Upper  Action  Bounds,  or  Polynomial 
Widths 

Consider  the  following  four  properties  of  state  spaces: 

Definition  6  A  state  space  has  no  duplicate  actions  iff,  for  all  s  €  S  and  a.  a'  £ 
A(s),  either  a  =  a'  or  succ(s,a)  ^  succ(s.a'),  i.e.  no  two  actions  that  are  applicable 
in  the  same  state  have  the  same  effect.14 


Definition  7  A  state  space  topology  has  a  constant  individrai  upper  action 
bound  6  £  7Z  iff  |A(s)|  <  b  for  all  s  £  S  and  all  n  €  A*.  ■  umber  of  ac¬ 

tions  that  are  applicable  in  a  state  is  bounded  from  above  by  a  con 

Definition  8  A  state  space  topology  has  a  linear  total  upper  action  bound  b  6  71 

iff  e  <  bn  for  all  n  £  M ,  i.e.  the  total  number  of  actions  increases  at  most  linearly  in 
the  number  of  states. 

MIf  the  agent  knows  that  two  or  more  actions  achieve  the  same  effect,  it  needs  to  consider  only  one 
of  them.  However,  according  to  our  assumptions  it  is  not  aware  of  the  fact  whether  two  actions  have 
the  same  effect. 
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Figure  15:  A  domain  for  which  an  admissible,  zero-initialized  Q-learning  algorithm 
(and  every  other  uninformed  on-line  search  algorithm)  can  need  at  least  l/6n3  —  1/6 n 
steps  to  reach  the  goal  state  (for  n  >  1) 


Note  that  every  state  space  topology  that  has  constant  individual  upper  action  bound 
b  also  has  linear  total  upper  action  bound  b. 

Definition  9  A  state  space  topology  has  polynomial  width  [35]  iff  there  exists  a 
polynomial  function  p  such  that  n  <  p(d)  for  all  n  6  AT,  i.e.  the  number  of  states  is  a 
polynomial  function  of  the  depth  of  the  state  space. 

The  following  inequalities  allow  us  to  express  0(ed)  in  terms  of  n  or  d  alone  by  providing 
an  upper  bound  on  e  that  depends  only  on  n  or  d:  e  <  n2  for  state  spaces  that  have 
no  duplicate  actions,  and  e  <  bn  for  state  spaces  that  have  linear  total  upper  action 
bound  b.  To  be  more  precise:  0(e)  =  0(n )  if  the  state  space  has  linear  total  upper 
action  bound  b,  since  then  n  <  e  <  bn.  Similarly,  e  <  p(d)  (where  p  is  a  polynomial 
function)  for  state  spaces  with  polynomial  width  that  either  have  no  duplicate  actions 
or  a  linear  total  upper  action  bound. 

If  a  state  space  has  no  duplicate  actions,  then  0(ed)  <  0(n3),  and  the  worst-case 
complexity  of  an  admissible  Q-learning  algorithm  becomes  0(rc3).  This  bound  is  tight 
for  a  zero-initialized  Q-learning  algorithm,  as  shown  in  Figure  15.  This  demonstrates 
that,  although  Q-learning  performs  undirected  exploration,  its  worst-case  complexity 
for  reaching  a  goal  state  is  polynomial  (and  no  longer  exponential)  in  n  if  the  action- 
penalty  representation  is  chosen  instead  of  the  goal-reward  representation.  Again,  every 
uninformed  on-line  search  algorithm  has  at  least  the  same  big-0  worst-case  complexity 
as  zero-initialized  Q-learning. 

If  a  state  space  topology  has  linear  total  upper  action  bound  6,  then  the  worst-case 
complexity  becomes  O(ed)  <  0(bn2)  =  0(n2).  If  it  has  polynomial  width  and  either  no 
duplicate  actions  or  a  linear  total  upper  action  bound,  then  e  <  p(d)  for  a  polynomial 
function  p  and  the  complexity  becomes  O(ed)  <  0(p'(d)),  where  p'  is  a  polynomial 
function.  If  the  depth  d  of  a  state  space  topology  is  bounded  from  above  by  a  constant 
(i.e.  the  upper  bound  is  independent  of  n),  then  O(ed)  =  0(e).  If  it  also  has  no 
duplicate  actions,  then  the  worst-case  complexity  decreases  to  O(ed)  =  0(e)  <  0(n2). 
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Figure  16:  An  example  of  a  two-dimensional  gridworld 


If  a  state  space  topology  has  both  a  depth  with  a  constant  upper  bound  and  a  linear 
total  upper  action  bound,  then  O(ed)  =  0(e)  =  0(n). 

Deterministic  gridworlds  with  discrete  states,  in  the  following  called  gridworlds,  which 
have  often  been  used  in  studying  reinforcement  learning,  see  for  example  [3],  [28],  [31], 
[33],  [37],  or  [24],  have  both  a  constant  individual  upper  action  bound  and  polynomial 
width.  In  the  state  space  shown  in  Figure  16,  for  example,  the  agent  can  move  from 
any  square  to  each  of  its  four  neighboring  squares  as  long  as  it  stays  on  the  grid  and  the 
target  square  does  not  contain  an  obstacle  (which  are  shaded  in  the  figure).  Consider 
now  such  a  rectangular  gridworld  with  size  x  xy,  but  without  obstacles.  It  has  n  —  xy 
states,  e  =  4xy  —  2x  —  2y  actions,  and  depth  d  =  x+y— 2.  It  has  no  identity  or  duplicate 
actions,  a  constant  individual  upper  action  bound  of  four  and  polynomial  width,  since 
n  =  xy  <  (x  -f  y)2  =  (d  -f  2)2.  It  holds  that  O(ed)  =  0(x2y  +  xy2).  If  the  gridworld 
is  quadratic,  i.e.  x  =  y,  then  the  depth  increases  sublinearly  in  n,  since  d  =  2 y/n  —  2, 
and  0(ed )  =  0(x3)  =  0(n 3/2).  Therefore,  exploration  in  unknown  gridworlds  actually 
has  very  low  complexity. 

4.4.2  State  Spaces  that  are  1-Step  Invertible  or  Eulerian 

Gridworlds  often  have  another  special  property. 

Definition  10  A  state  space  is  1-step  invertible  [35]  iff  it  has  no  duplicate  ac¬ 
tions  and,  for  all  s  €  S  and  a  €  A(s),  there  exists  an  a'  £  A(succ(s,o))  such  that 
succ(succ(s,a),a')  =  s,  i.e.  the  effect  of  an  action  can  be  reversed  immediately. 
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Figure  17:  A  domain  for  which  a  random  walk  needs  l/8n3  +  1/8 n2  —  5/8 n  -f  3/8 
steps  on  average  to  reach  the  goal  state  (for  odd  n  >  1)  and  for  which  an  admissible, 
zero-initialized  Q-learning  algorithm  can  need  at  least  l/16n3  +  3/8 n2  —  3/16n  —  1/4 
steps  to  reach  the  goal  state  (for  n  >  1  with  n  mod  4=1) 


Definition  11  A  state  space  is  Eulerian  iff  |A(s)|  =  |{(s',  a')  :  succ(s',a')  =  s  A  s'  6 
S  A  a'  6  A(s')}|  for  all  s  €  S,  i.e.  there  are  as  many  actions  that  enter  a  state  as  there 
are  actions  that  leave  the  state. 

Gridworlds  are  usually  1-step  invertible,  and  every  1-step  invertible  state  space  is  Eu¬ 
lerian. 

We  do  not  assume  that  the  agent  knows  that  the  state  space  is  l-step  invertible  or 
Eulerian.  Even  a  zero-initialized  Q-learning  algorithm  with  goal-reward  representation 
(i.e.  a  random  walk)  is  tractable  for  Eulerian  state  spaces,  as  the  following  theorem 
states. 

Theorem  4  A  zero-initialized  Q-learning  algorithm  with  goal-reward  representation 
reaches  a  goal  state  and  terminates  after  at  most  0(ed )  steps  on  average  if  the  state 
space  is  Eulerian. 

This  theorem  is  an  immediate  corollary  to  [1].  O(ed)  <  0(en ),  since  d  <  n  —  1  in  every 
strongly  connected  state  space.  Thus,  a  zero-initialized  Q-learning  algorithm  with  goal- 
reward  representation  reaches  a  goal  state  and  terminates  after  at  most  0{en)  steps  on 
average  if  the  state  space  is  Eulerian.  If  the  state  space  also  has  no  duplicate  actions, 
then  the  largest  average-case  complexity  becomes  0(n3).  Figure  17  shows  that  this 
bound  is  tight. 

As  an  example  of  a  gridworld,  consider  the  one-dimensional  gridworld  in  Figure  18. 
The  average  number  of  steps  that  a  random  walk  needs  to  reach  the  goal  state  of  this 
domain  is  a  standard  result  in  Operations  Research  for  a  symmetric  random  walk  in 
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Figure  18:  A  domain  (“one-dimensional  gridworid”)  for  which  every  search  algorithm 
needs  at  least  n  —  1  steps  to  reach  the  goal  state,  and  a  random  walk  needs  n 2  —  2/2  -f- 1 
steps  on  average  to  reach  the  goal  state  (for  n  >  1) 


Figure  19:  A  domain  for  which  every  uninformed  on-line  search  algorithm  can  need  at 
least  e  -f  n  —  4  steps  to  reach  the  goal  state  (for  n  >  2  and  e  >  2n  —  2)  and  for  which 
an  admissible,  zero-initialized  Q-learning  algorithm  can  need  at  least  1/2 en  —  1/4 n2  — 
1/2 n  +  2  steps  to  reach  the  goal  state  (for  even  n  >  2  and  e  >  2n  —  2) 


one  dimension  with  one  reflecting  and  one  absorbing  barrier  [7].  In  this  case,  generating 
functions  can  be  used  to  solve  Equations  4. 

Although  the  complexity  of  a  random  walk  decreases  in  1-step  invertible  state  spaces, 
the  worst-case  complexity  of  an  admissible,  zero-initialized  Q-learning  algorithm  with 
action-penalty  representation  does  not  change.  It  remains  tight  at  0(en )  in  general 
(an  example  is  given  in  Figure  19)  and  0(n3)  for  state  spaces  that  have  no  duplicate 
actions  (an  example  is  shown  in  Figure  17). 

To  summarize,  the  largest  average-case  complexity  of  a  random  walk  is  polynomial  (and 
no  longer  exponential)  in  n  if  the  state  space  is  Eulerian  and  has  no  duplicate  actions. 
In  fact,  the  largest  big-0  average-case  complexity  of  a  random  walk  equals  the  big- 
O  worst-case  complexity  of  an  admissible,  zero-initialized  Q-learning  algorithm,  and 
we  can  no  longer  expect  an  exponential  improvement  in  expected  performance  when 
switching  from  a  zero-initialized  Q-learning  algorithm  with  goal-reward  representation 
to  one  of  the  other  algorithms. 

One  can  seriously  consider  random  walks  as  a  solution  method  for  reaching  a  goal  state 
in  Eulerian  state  spaces,  since  they  can  no  longer  be  much  less  efficient  than  the  other 
algorithms.  Note  two  small  advantages  of  random  walks:  To  perform  a  random  walk, 
one  does  not  need  any  space  to  store  information,  whereas  an  undiscounted,  zero- 
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initialized  Q-learning  algorithm  with  (at  most)  action-penalty  representation  needs 
0(e)  Q-values  with  O(log2(d))  bits  each.  Also,  a  random  walk  reaches  a  goal  state  in 
an  infinite  one-  or  two-dimensional  (but  not  higher-dimensional)  gridworld  with  proba¬ 
bility  one  [7],  whereas  the  other  algorithms  cannot  guarantee  to  terminate  successfully 
(without  modifications). 

However,  even  for  state  spaces  for  which  their  big-0  complexities  are  identical  (such 
as  the  one  in  Figure  17)  we  expect  some  improvement  in  performance,  e.e.  a  linear 
improvement  (i.e.  improvement  by  a  constant  factor),  when  switching  from  goal-reward 
representation  to  action-penalty  representation,  since  information  about  the  topology 
of  the  state  space  is  only  remembered  immediately  in  the  latter  case. 

If  the  state  space  is  1-step  invertible  and  the  agent  knows  for  every  action  which  other 
action  reverses  its  effects,  then  it  can  use  chronological  backtracking  to  reach  a  goal 
state.  Under  the  assumption  that  it  has  to  execute  an  action  at  least  once  before  it 
knows  its  effect,  chronological  backtracking  achieves  a  worst-case  complexity  of  0(e), 
since  every  action  (except  for  the  action  that  leaves  the  goal  state)  is  executed  exactly 
once  in  the  worst  case.  In  fact,  even  if  the  agent  initially  does  not  know  which  action 
inverts  which  other  action,  there  is  an  algorithm  for  Eulerian  state  spaces  that  executes 
every  action  at  most  twice  and  thus  achieves  the  same  big-0  worst-case  complexity 
[6]:  “Take  unexplored  edges  whenever  possible.  If  stuck,  consider  the  closed  walk  of  of 
unexplored  edges  just  completed,  and  retrace  it,  stopping  at  nodes  that  have  unexplored 
edges,  and  applying  this  algorithm  recursively  from  each  such  node.”  We  call  this 
algorithm  the  Deng-Papadimitriou  algorithm  after  its  authors.  No  uninformed 
on-line  search  algorithm  can  do  better  than  0(e),  see  for  example  Figure  19,  since  the 
agent  has  to  execute  every  action  at  least  once  in  the  worst  case  to  find  out  about 
its  effect.  However,  Q-learning  does  not  achieve  this  complexity.  Figure  17  gives  an 
example  of  a  1-step  invertible  state  space  that  has  0(n2)  actions,  but  Q-learning  may 
need  0(n3)  steps  to  reach  a  goal  state.  Thus,  if  one  considers  only  Eulerian  state  spaces, 
it  no  longer  holds  that  every  uninformed  on-line  search  algorithm  has  at  least  the  same 
big-0  worst-case  complexity  as  Q-learning.  This  demonstrates  that  algorithms  that 
know  about  special  properties  of  the  state  space,  i.e.  that  use  more  knowledge  of  the 
state  space  than  Q-learning,  can  have  a  smaller  big-0  worst-case  complexity. 

4.4.3  Summary 

Many  reinforcement  learning  domains  have  certain  properties  that  decrease  the  worst- 
case  complexity  of  uninformed  on-line  search  algorithms.  We  have  focused  on  grid- 
worlds,  because  they  are  popular  among  reinforcement  learning  researchers.  However, 
many  other  search  domains  that  are  commonly  used  in  Artificial  Intelligence  have  the 
properties  discussed  here.  The  8-puzzle,  for  example,  has  no  identity  or  duplicate 
actions,  a  constant  individual  upper  action  bound,  and  is  1-step  invertible. 

For  Eulerian  state  spaces,  we  have  shown  that  the  largest  big-0  average-case  complexity 
of  random  walks  equals  the  big-0  worst-case  complexity  of  admissible,  zero-initialized 
Q-learning  algorithms.  However,  there  are  uninformed  on-line  search  algorithms  that 
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have  a  smaller  big-0  worst-case  complexity  than  Q-learning.  This  is  very  different  from 
the  general  case.  There,  the  largest  big-0  average-case  complexity  of  random  walks 
is  much  larger  than  the  big-0  worst-case  complexity  of  admissible,  zero-initialized 
Q-learning  algorithms,  and  every  uninformed  on-line  search  algorithm  has  at  least 
the  same  big-0  worst-case  complexity  as  Q-learning.  Thus,  general  results  about  the 
behavior  of  reinforcement  learning  algorithms  might  not  be  specific  enough.  It  can  be 
more  illuminating  to  identify  specific  properties  of  the  domains  of  interest  and  then 
analyze  the  behavior  of  reinforcement  learning  algorithms  in  domains  that  possess  these 
properties. 


5  Finding  Optimal  Policies  with  Q-Learning 

We  now  consider  the  problem  of  finding  shortest  paths  from  all  states  to  a  goal  state. 
We  present  novel  extensions  of  the  Q-learning  algorithm  to  solve  this  problem  that 
have  the  same  big-0  worst-case  complexity  as  an  admissible,  zero-initialized  Q-learning 
algorithm  for  finding  a  single  arbitrary  path  from  the  start  state  to  a  goal  state.  The 
purpose  of  these  algorithms,  that  we  call  bi-directional  Q-learning,  is  to  be  able  to 
explore  the  state  space  sufficiently  to  solve  the  task  and  terminate  -  their  purpose  is  not 
to  exploit  the  acquired  knowledge.  First,  we  describe  a  version  that  needs  to  know  an 
upper  bound  on  the  depth  of  the  state  space  in  advance  and  then  proceed  to  describe 
a  slightly  more  complicated  version  that  does  not  need  to  know  any  information  about 
the  state  space  in  advance. 

5.1  Bi-Directional  Q-Learning  (Version  1) 

The  algorithm,  which  we  term  the  bi-directional  (1-step)  Q-learning  algorithm 
(version  1),  is  presented  in  Figure  20.  It  assumes  only  that  the  agent  can  recognize 
whether  it  is  in  a  goal  state  and  knows  an  upper  bound  ub(d)  of  d.  It  could  for 
example  know  n  or  an  upper  bound  of  n.  In  the  ideal  case,  it  knows  the  exact  value  of 
d.  While  the  complexity  results  presented  here  are  for  the  undiscounted,  zero-initialized 
version  with  action-penalty  representation,  they  can  easily  be  transferred  to  all  of  the 
previously  described  alternatives. 

The  bi-directional  Q-learning  algorithm  uses  the  Q/-values  (17/- values)  to  implement 
an  on-line  search  for  a  goal  state  in  exactly  the  same  way  how  the  Q-learning  algorithm 
from  Figure  3  uses  the  Q-values  ((//-values). 

We  say  that  state  s  £  S  is  done  iff  U/(s)  =  Ujpt(s)  =  —gd(s),  and  that  action 
a  £  /4(s)  is  done  in  s  £  5  iff  Qj(s,a)  =  Qopt{s,a),  i.e.  Q/[s,a)  =  0  for  s  £  G 
and  Qj(s,a)  =  —  1  —  gd(succ(s,a))  for  s  £  S  \  G.  For  every  s  £  S,  we  introduce  a 
value  done(s),  and,  for  every  s  £  S  and  a  £  j4(s),  we  introduce  a  value  done(s,a), 
with  the  following  semantics:  done(s)  =  true  iff  the  agent  knows  that  s  £  S  is  done, 
and  similarly  for  done(s,a).  Thus,  if  done(s)  =  true,  then  Uj(s)  =  —gd(s)  (but  not 
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Initially,  Qj(s,u)  =  Qb{s,a )  =  0  and  done(s,a)  =  false  for  all  s  €  S  and  a  €  A(s). 
/*  Also,  no  steps  have  been  taken  so  far,  i.e.  t  =  0.  */ 


1.  Set  s  :=  the  current  state. 

2.  If  s  £  G,  then  set  done(s,a)  :=  true  for  all  a  £  A(s). 

3.  If  done(s)  =  true,  then  go  to  8. 

4.  /*  forward  step  */ 

Set  a  :=  argmaxa,6>l(3)Q/(s,a')- 

5.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  —1  and  is  in  state  succ(s,a). 
Increment  the  number  of  steps  taken,  i.e.  set  t  :=  t  +  1.  */ 

6.  Set  Q/(s,a)  :=  —1  +  U/(succ(s,a))  and  done(s,a)  :=  done(succ(s,a)). 

7.  Go  to  1. 

8.  /*  backward  step  *j 

Set  a  :=  argmax^g^jQi^a'). 

9.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  —1  and  is  in  state  succ(s,a). 
Increment  the  number  of  steps  taken,  i.e.  set  t  :=  /  +  1.  */ 

10.  Set  Qb(s,a)  :=  —1  -f  Ub{succ(s,a)). 

11.  If  Ub(s)  <  —ub(d),  then  stop. 

12.  Go  to  1. 

where 

Uf(s)  :=  maxaeA(3)Qf(3,a), 

done(s)  :=  3a6A(,)(Q/(s,a)  =  maxa/€/qs)  Qj(s,a')  A  done(s,  a)),  and 
Ub(s)  :=  maxa€j4(a)  Qb(s,a) 
at  every  point  in  time. 

Figure  20:  The  bi-directional  Q-learning  algorithm  (version  1) 
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necessarily  the  other  way  around)  and  done(s)  remains  true  until  termination. 

Initially,  done(s)  =  false  for  all  s  6  S,  but  the  agent  can  gain  additional  knowledge 
according  to  the  following  three  rules: 

1.  If  s  €  G  and  Qf(s,a)  =  0,  then  a  €  A(s)  is  done  in  s  €  S. 

2.  If  done(succ(s,  a))  =  true  and  Qj(s,a )  =  —  1  +  U f(succ(s,a)),  then  a  6  A(s)  is 
done  in  s  €  S  \  G. 

3.  If  done(s,a )  =  true  and  U/(s)  =  Q  a)  t^i  an  a  6  A(s),  then  s  £  S  is  done 
(since  the  (^/-values  are  monotonica;  decreasing). 


If  the  agent  was  in  a  state  s  for  which  done(s)  was  false,  executed  action  a,  and 
afterwards  is  in  a  state  s'  for  which  done(s')  is  true ,  then  done(s,a)  was  false  before 
the  step,  but  can  be  set  to  true  afterwards.  We  call  such  a  transition  an  important 
transition.  After  the  agent  has  made  at  most  e  important  transitions,  all  n  states  are 
done.  If  the  agent  is  in  a  state  s  with  done(s)  =  false  and  uses  Q-learning  to  reach  a 
goal  state,  then  it  will  finally  make  an  important  transition,  since  it  sets  done(s')  to  true 
in  s'  €  G  if  it  was  not  already  set  to  true.  Immediately  after  the  important  transition, 
the  agent  can  use  another  Q-learning  algorithm  with  different  <5 (,-values  to  reach  a 
state  in  Gj  :=  {s  €  S  \  G  :  done(s)  =  false},  and  will  eventually  reach  such  a  state  if 
one  still  exists,  in  which  case  it  uses  the  Q/-values  again  to  reach  a  goal  state.  If  such 
a  state  no  longer  exists,  the  C4-values  will  decrease  without  a  limit.  The  algorithm 
can  terminate  when  a  £/$-v alue  drops  below  —ub(d),  since  there  cannot  be  shortest 
paths  to  any  state  that  are  longer  than  d.  Thus,  the  agent  knows  that  done(s)  =  true 
for  all  s  €  S,  can  conclude  that  Uj(s)  =  —gd(s)  for  all  s  6  S,  and  may  terminate. 
(This  termination  criterion  is  peculiar  to  version  1  of  the  bi-directional  Q-learning 
algorithm,  and  will  be  replaced  in  version  2.)  Then,  an  optimal  policy  is  to  select 
action  argmaxae4(j){//(succ(s,  a))  or,  equivalently,  argmaxa€/4(j) Jone(s  a)=trueQj{s,  a)  in 
state  s  €  S  \G,  where  Qf(s,a)  and  Uj(s)  are  the  (^/-values  and  Uj- values  upon 
termination,  since  it  is  optimal  to  always  execute  an  action  that  decreases  the  goal 
distance  most  (i.e.,  in  this  case,  by  one). 

To  summarize,  the  bi-directional  Q-learning  algorithm  iterates  over  two  independent 
Q-learning  searches:  a  forward  phase  that  uses  Q /-values  to  search  for  a  state  s' 
with  done(s')  =  true  from  a  state  s  with  done(s)  =  false,  followed  by  a  backward 
phase  that  uses  Qt,- values  to  search  for  a  state  s'  with  done(s')  =  false  from  a  state 
s  with  done(s)  =  true.  The  forward  and  backward  phases  are  implemented  using  the 
Q-learning  algorithm  from  Figure  3.  Every  forward  phase  sets  at  least  one  additional 
done(s,a)  value  to  true  and  then  transfers  control  to  the  backward  phase,  which  con¬ 
tinues  until  a  state  s  with  done(s)  =  false  is  reached,  so  that  the  next  forward  phase 
can  begin. 

Theorem  5  The  bi-directional  Q-learning  algorithm  (version  1)  finds  an  optimal  pol¬ 
icy  and  terminates  after  at  most  0(e  x  ub(d))  steps. 
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The  proof  of  Theorem  5  is  similar  to  that  of  Theorem  2.  The  bi-directional  Q-learning 
algorithm  can  be  made  more  efficient,  for  example  by  breaking  ties  more  intelligently, 
but  this  does  not  change  its  big-0  worst-case  complexity. 

The  agent  has  to  make  sure  that  the  value  of  ub(d)  that  it  uses  does  not  underestimate 
d.  If  ub(d)  scales  linearly  with  the  correct  value  of  d,  then  0(e  x  ub{d ))  =  0(ed).  This 
rules  out  that  the  agent  uses,  for  example,  ub(d)  >  d2,  just  to  be  conservative.  Since 
d  <  n  —  1  in  every  strongly  connected  state  space,  it  holds  that  0(ed)  <  0(en).  If 
the  agent  uses  n  -  1  as  an  upper  bound  of  d ,  the  bi-directional  Q-learning  algorithm 
(version  1)  has  worst-case  complexity  0(en),  but  not  necessarily  O(ed),  since  d  can 
increase  sublinearly  in  n.  If  the  state  space  has  no  duplicate  actions,  the  complexity 
becomes  0(n3).  That  these  bounds  are  tight  follows  from  Figures  11  and  15,  since 
finding  optimal  policies  cannot  be  easier  than  reaching  a  goal  state  for  the  first  time. 
So,  if  ub(d)  scales  linearly  with  d  (e.g.  because  the  agent  uses  ub{d)  =  d),  then  the 
bi-directional  Q-learning  algorithm  (version  1)  has  exactly  the  same  big-0  worst-case 
complexity  as  the  Q-learning  algorithm  for  finding  any  path  from  the  start  state  to  a 
goal  state.  This  is  surprising,  since  one  could  have  expected  finding  optimal  policies 
to  be  harder  for  Q-learning  than  simply  reaching  a  goal  state. 

5.2  Bi-Directional  Q-Learning  (Version  2) 

The  bi-directional  Q-learning  algorithm  that  we  have  presented  in  the  previous  chapter 
needs  to  know  an  upper  bound  ub(d)  on  the  depth  of  the  state  space.  ub(d)  is  only 
needed  to  provide  a  termination  criterion.  If  the  algorithm  only  has  to  find  an  optimal 
policy  in  the  limit,  but  does  not  need  to  terminate,  ub(d)  is  not  needed.  Of  course, 
then  the  agent  always  explores  and  never  exploits. 

It  is  easy  to  construct  an  algorithm  that  finds  an  optimal  policy  and  terminates  but 
does  not  need  to  know  an  upper  bound  on  the  depth  of  the  state  space  or  any  other 
information  about  the  state  space  in  advance  if  we  do  not  require  the  algorithm  to  be 
memoryless.  Note  that  the  worst-case  complexity  of  any  algorithm  that  finds  optimal 
policies  cannot  be  less  than  O(ed)  (no  matter  how  the  algorithm  determines  that  an 
optimal  policy  has  been  found),  since  finding  optimal  policies  cannot  be  easier  than 
reaching  a  goal  state  for  the  first  time  and  Figure  11  showed  that  this  has  a  worst-case 
complexity  of  at  least  0(ed). 

In  the  following,  we  show  a  version  of  the  bi-directional  Q-learning  algorithm  that  does 
not  require  the  agent  to  know  anything  about  the  state  space  in  advance,  needs  an 
internal  memory  of  at  most  0( log2  e)  bits,  and  has  a  tight  worst-case  complexity  of 
0(ed).  To  distinguish  the  version  described  earlier  from  the  one  introduced  here,  we 
continue  to  call  the  original  version  “bi-directional  Q-learning  algorithm  (version  1)” 
(or,  shorter,  “bi-directional  Q-learning  algorithm”)  and  refer  to  the  new  version  as 
“bi-directional  Q-learning  algorithm  (version  2)”.  Version  2  not  only  requires  no  initial 
knowledge  of  the  state  space,  but  also  remedies  the  problem  that  the  complexity  might 
be  large,  only  because  an  overly  conservative  estimate  of  d  is  used. 
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The  bi-directional  Q-learning  algorithm  (version  2)  is  shown  in  Figure  21.  It 
differs  from  version  1  (shown  in  Figure  20)  only  in  the  termination  criterion.  To 
implement  the  termination  criterion,  version  2  maintains  one  additional  variable  that 
version  1  does  not  use.  The  variable  memory  implements  an  internal  memory  that 
the  agent  maintains  across  action  executions,  memory  stores  integer  values  that  range 
from  0  to  at  most  e.  Thus,  the  internal  memory  needs  at  most  |"log2(e  +  1)]  bits,  which 
is  not  enough  to  store  a  map  of  the  state  space. 

The  general  idea  behind  the  termination  criterion  is  as  follows.  The  agent  can  terminate 
if  done(s)  =  true  for  all  s  £  S,  since  then  an  optimal  policy  has  been  found.  Initially, 
done(s)  =  false  for  all  s  £  S.  Thus,  done(s)  —  true  for  all  s  €  S  if  done(s)  =  true 
for  every  state  s  that  the  agent  has  already  explored,  i.e.  visited  at  least  once,  and 
the  agent  has  explored  all  states.  Every  action  that  the  agent  has  not  yet  explored, 
i.e.  executed  at  least  once,  could  potentially  lead  to  an  unexplored  state.  Thus,  the 
agent  can  be  sure  that  it  has  explored  all  states  iff  it  has  explored  all  actions.  This 
reasoning  leads  to  the  following  idea.  The  agent  maintains  in  its  internal  memory  the 
sum  of  the  number  of  unexplored  actions  that  it  knows  about  and  the  number  of  states 
s  with  done(s)  =  false  that  it  knows  about.  It  can  terminate  when  this  sum  reaches 
zero,  since  then  done(s)  =  true  for  all  explored  states  s  £  5  and  all  states  have  been 
explored,  i.e.  done(s)  =  true  for  all  s  £  S. 

This  idea  could  be  implemented  exactly  as  stated.  However,  a  couple  of  additional 
observations  simplify  the  implementation.  For  example,  note  that  all  actions  in  a  non¬ 
goal  state  s  have  been  explored  if  done(s)  =  true.  Therefore,  the  algorithm  does  not 
need  to  keep  track  of  unexplored  actions  in  non-goal  states.  Similarly,  done(s)  =  true 
for  a  goal  state  s  if  all  actions  in  s  have  been  explored.  Thus,  the  algorithm  does  not 
need  to  keep  track  of  the  number  of  goal  states  s  with  done(s)  =  false.  In  short,  the 
agent  needs  to  maintain  in  its  internal  memory  (i.e.  the  variable  memory )  only  the 
sum  of  the  number  of  unexplored  actions  in  goal  states  that  it  knows  about  and  the 
number  of  non-goal  states  s  with  done(s)  =  false  that  it  knows  about.  Again,  the 
agent  knows  that  it  can  terminate  when  this  sum  reaches  zero. 

This  termination  criterion  can  easily  be  implemented.  Initially,  memory  =  0. 

•  Whenever  the  agent  is  in  a  state  s  that  was  still  unexplored  one  step  ago,  then 
this  state  has  not  been  taken  into  account  when  calculating  the  sum,  and  the 
sum  has  to  be  adjusted  properly.  If  s  is  a  goal  state,  the  sum  has  to  be  increased 
by  |A(s)|,  since  all  of  its  |A(s)|  actions  are  still  unexplored.  If  s  is  not  a  goal 
state,  memory  has  to  be  increased  by  one,  since  done(s )  was  false  initially  and 
remains  false  for  at  least  as  long  as  no  action  has  been  executed  in  s. 

•  Whenever  done(s)  changes  for  a  non-goal  state  s,  the  sum  has  to  be  adjusted 
properly,  done(s)  can  only  change  for  the  current  state,  it  can  only  change  from 
false  to  true ,  and  it  can  only  change  after  an  action  execution  in  the  forward 
step.  Furthermore,  a  forward  step  is  only  executed  for  non-goal  states.  Thus, 
when  done(s)  changes  for  the  current  state  s  after  action  execution  in  a  forward 
step,  memory  has  to  be  decreased  by  one. 
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Initially,  memory  =  0,  Qj(s,a )  =  Qb{s,a)  =  0,  and  done(s,a)  =  false  for  all  s  €  S 
and  a  €  A(s).  /*  Also,  no  steps  have  been  taken  so  far,  i.e.  t  =  0.  */ 

1.  Set  s  :=  the  current  state. 

2.  If  s  6  (7,  then  set  done(s,a)  :=  irwe  for  all  a  €  A(s). 

3.  If  Qf(s,a)  =  Qb(s,a)  =  0  for  all  a  6  A(s), 

then  set  memory  :=  memory  +  (if  s  €  G  then  |A(s)|  else  1). 

4.  If  memory  =  0,  then  stop. 

5.  If  done(s)  =  frue,  then  go  to  11. 

6.  /*  forward  step  */ 

Set  a  :=  argmax0/6>i(3)(5/(s,  a'). 

7.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  —1  and  is  in  state  succ(s,a). 
Increment  the  number  of  steps  taken,  i.e.  set  t  :=  t  -j-  1.  */ 

8.  Set  Qf(s,a )  :=  —1  +  U f(succ(s,a))  and  done(s,a)  :=  done(succ(s,a)). 

9.  If  done(s)  has  changed  during  the  execution  of  step  8, 
then  set  memory  :=  memory  —  1. 

10.  Go  to  1. 

11.  /*  backward  step  */ 

Set  a  :=  argma xa,eA(a)Qb(s,a'). 

12.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  —1  and  is  in  state  succ(s,a). 
Increment  the  number  of  steps  taken,  i.e.  set  t  :=  t  +  1.  * / 

13.  Set  Qb(s,a)  :=  —1  +  Ub(succ(s,a)). 

14.  If  s  6  G  and  Qb(s,a)  has  changed  from  zero  to  non-zero  during  the  execution  of 
step  13,  then  set  memory  :=  memory  -  1. 

15.  Go  to  1. 
where 

Uf(s)  :=  ma xaeA[s)Q f{s,a), 

done(s)  :=  a)  =  maxa>^(5)  Qf(s,a')  A  done(s,a)),  and 

Ub(s)  :=  max^,)  Qb(s,a) 
at  every  point  in  time. 

Figure  21:  The  bi-directional  Q-learning  algorithm  (version  2) 
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•  Whenever  an  action  in  a  goal  state  becomes  explored,  the  sum  has  to  be  adjusted 
properly.  Actions  in  goal  states  can  only  be  executed  in  backward  steps.  There¬ 
fore,  when  an  action  in  a  goal  state  becomes  explored  after  an  action  execution 
in  a  backward  step,  memory  has  to  be  decreased  by  one. 


Thus,  the  agent  needs  to  be  able  to  detect  whether  its  current  state  was  unexplored 
one  step  ago  and  whether  an  action  in  a  goal  state  is  unexplored  or  not.  In  general,  an 
action  a  €  A(s)  has  never  been  executed  in  a  forward  step  iff  Qf(s,a)  =  0.  Similarly, 
it  has  never  been  executed  in  a  backward  step  iff  Qb{s ,  a)  =  0.  Thus,  the  current  state 
s  of  the  agent  was  unexplored  one  step  ago  iff  the  agent  has  never  executed  any  action 
in  s,  i.e.  Qb(s,  a)  =  Q/(s,  a)  =  0  for  all  u,  £  A(s).  An  action  a  €  A(s)  in  a  goal  state 
s  is  unexplored  iff  Q/,(s,a)  =  0,  since  actions  in  goal  states  can  only  be  executed  in 
backward  steps  and  therefore  it  always  holds  that  Q /(s,  a)  =  0. 

These  thoughts  translate  directly  to  the  bi-directional  Q-learning  algorithm  (ver¬ 
sion  2).  Since  version  1  and  version  2  of  the  bi-directional  Q-learning  algorithm 
differ  only  in  the  termination  criterion,  both  behave  identically  after  every  step 
and  therefore  find  the  same  optimal  policy  (if  ties  are  broke.  in  the  same  way). 
Thus,  it  is  still  optimal  to  select  action  argmaxa6A(J)f//(sitcc(s,a))  or,  equivalently, 
argmaxa6/1(i)  done(s  a)_trtte<5/(5,a)  in  state  s  G  S\G,  where  Qf(s,a)  and  U}(s)  are  the 
<5/- values  and  Uf- values  upon  termination. 

Theorem  6  The  bi-directional  Q-leaming  algorithm  (version  2)  finds  an  optimal  pol¬ 
icy  and  terminates  after  at  most  O(ed)  steps. 


To  prove  this  theorem,  one  shows  that  the  termination  criterion  of  the  bi-directional 
Q-learning  algorithm  (version  2)  is  implied  by  the  one  of  the  bi-directional  Q-learning 
algorithm  (version  1).  This  shows  that  version  2  always  terminates  no  later  than 
version  1  no  matter  what  ub(d)  is  (provided  that  ties  are  broken  in  the  same  way).15 
If  ub(d)  —  d,  then  the  worst-case  complexity  of  the  bi-directional  Q-learning  algorithm 
(version  1)  is  0(ed).  Therefore,  the  worst-case  complexity  of  version  2  is  0(ed )  as  well. 
If  the  state  space  has  no  duplicate  actions,  the  complexity  becomes  0(n3).  That  these 
bounds  are  tight  follows  from  Figures  11  and  15,  since  finding  optimal  policies  cannot 
be  easier  than  reaching  a  goal  state  for  the  first  time. 


5.3  Comparison  of  Bi-Directional  Q-Learning  to  other  On- 
Line  Search  Algorithms 

To  find  optimal  policies,  the  agent  has  to  execute  every  action  at  least  once  in  order  to 
find  out  about  its  effect.  If  the  state  space  is  Eulerian,  then  the  Deng-Papadimitriou 

15To  be  precise:  The  bi-directional  Q-learning  algorithm  (version  2)  checks  the  termination  criterion 
a  couple  of  lines  later  than  version  1.  Thus,  version  2  always  terminates  at  most  a  couple  of  program 
statements  later  than  version  1. 
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algorithm  can  be  used  to  learn  a  map  which  is  ohen  used  for  finding  optimal  policies. 
It  executes  every  action  at  most  twice,  i.e.  it  has  complexity  0(e)  and  is  competitive. 
No  uninformed  on-line  search  algorithm  can  do  better.  However,  the  bi-directional  Q- 
learning  algorithms  do  not  achieve  this  complexity,  as  shown  in  Figure  17.  This  shows 
again  that  on-line  search  algorithms  that  know  about  special  properties  of  the  state 
space  can  have  an  advantage  over  less  informed  algorithms.  First  worst-case  results  for 
uninformed  on-line  search  algorithms  in  non-Eulerian  state  spaces  are  reported  in  [6]. 

6  Complexity  of  Value-Iteration 

The  results  derived  for  Q-learning  can  eas’ly  be  transferred  to  value-iteration.  In  the 
following,  we  state  the  main  definitions,  lemmas,  and  theorems  as  they  apply  to  value- 
iteration. 

Definition  12  A  value-iteration  algorithm  is  initialized  with  q  £  7Z  (or,  synony¬ 
mously,  q-initialized)  iff  initially,  for  all  s  £  G,  U(s)  —  0,  and,  for  all  s  £  S  \G, 
U(s)  =  q. 

6.1  Reaching  a  Goal  State  with  Value-Iteration 

A  zero-initialized  value-iteration  algorithm  with  goal-reward  representation  performs 
a  random  walk.  Therefore,  Theorem  7  follows  immediately  from  Theorem  1,  and 
Theorem  8  follows  from  Theorem  4. 

Theorem  7  The  expected  number  of  steps  that  a  zero-initialized  value- iteration  algo¬ 
rithm  with  goal-reward  representation  needs  in  order  to  reach  a  goal  state  and  terminate 
can  be  exponential  in  n. 

Theorem  8  An  zero-initialized  value-iteration  algorithm  with  goal-reward  representa¬ 
tion  reaches  a  goals  state  and  terminates  after  at  most  O(ed)  steps  on  average  if  the 
state  space  is  Eulerian. 

An  undiscounted  value-iteration  algorithm  that  always  executes  the  action  that  leads 
to  the  state  with  the  largest  (/-value  is  equivalent  to  the  Learning  Real-Time  A* 
(LRTA*)  algorithm  [15]  [16]  [17]  with  a  search  horizon  of  one  if  the  state  space  is 
deterministic  and  the  action  penalty  representation  is  used  [2].  ([2]  also  showed  that 
the  LRTA*  algorithm  can  be  generalized  to  probabilistic  domains.)  Ishida  and  Korf 
proved  that  a  zero-initialized  LRTA*  algorithm  is  guaranteed  to  .each  a  goal  state 
in  at  most  0(n2)  steps  if  the  state  space  has  no  identity  actions.  This  follows  from 
the  analysis  of  the  Moving  Target  Search  (MTS)  algorithm  in  [10]  if  the  position  of 
the  target  does  not  change,  see  [14]  for  a  detailed  discussion  and  bibliography.  As 
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argued  earlier,  the  agent  does  not  need  to  execute  identity  actions,  since  they  cannot 
be  part  of  an  optimal  plan.  However,  the  value- iteration  algorithm  treats  an  identity 
action  exactly  in  the  same  way  that  it  treats  every  other  action  (although  it  could 
be  augmented  to  identify  and  avoid  identity  actions).  Thus,  the  presence  of  identity 
actions  can  potentially  make  the  path  planning  task  harder. 

The  following  theorems  about  value- iteration  generalize  Ishida  and  Korf’s  result  to 
state  spaces  that  can  contain  identity  actions  and  show  the  effects  of  various  domain 
properties  and  initial  (/-values  that  are  non-zero  on  the  worst-case  complexity.  Defi¬ 
nition  13,  14,  and  15,  as  well  as  Lemma  3  assume  no  discounting,  but  can  easily  be 
modified  for  the  discounted  case. 

Definition  13  U -values  are  consistent  iff,  for  all  s  €  G,  U(s)  =  0,  and,  for  all 
s  £  S  \  G,  maxa€/i(a)(  —  1  +  U(succ(s,a)))  <  U(s)  <  0. 

Definition  14  U -values  are  admissible  iff,  for  all  s  £  S.  ^gd(s)  <  U(s)  <  0. 

Zero-initialized  (/-values  are  consistent,  and  consistent  (^-values  are  admissible.  U- 
values  are  admissible  iff  —U  is  an  admissible  heuristic  for  the  goal  distances  of  the 
states  in  the  context  of  A*-search. 

Definition  15  A  value-iteration  algorithm  inadmissible  iff  it  uses  action-penalty  rep¬ 
resentation,  its  action  selection  step  is  “a  :=  argmaxa,€A^(r(s,a')  +  U(succ(s,a')))  = 
argmaxa,eA^{  —  1  +  U{succ{s,  a'))),  ”16  and  either 

•  its  initial  U -values  are  consistent,  and  its  value  update  step1,  is  “ Set  i’{s)  := 
—  1  +  U(succ{s ,  a)),  ”  or 

•  its  initial  U -values  are  admissible,  and  its  value  update  step  is  “ Set  U{s)  :  = 
min((/(s),  —1  +  U(succ{s,a))).’7 

If  a  value-iteration  algorithm  is  admissible,  then  consistent  (admissible)  6r-values  re¬ 
main  consistent  (admissible)  after  every  step  of  the  agent  and  are  monotonically  de¬ 
creasing.  A  time  superscript  of  t  in  the  following  lemma  refers  to  the  values  of  the 
variables  immediately  before  the  action  execution  step,  i.e.  line  4,  of  step  t. 

Lemma  3  For  all  steps  t  £  jVq  (until  termination)  of  an  undiscounted,  admissible 
value-iteration  algorithm, 

CV)  +  E .«  V°{,)  -  t  >  C'(»)  +  CV)  -  loop' 

16Since  F(s,a')  =  —  1  for  all  s  €  5  \  G  and  a'  g  A{s),  the  action  selection  step  can  be  simplified  to 
“a  :=  argmaxa,€/Uj)(/(succ(s,a')) 

17Note  that  the  action  selection  step  "a  :=  argmax„,€i4( s)(- 1  +  U{succ(s,  a')))"  allows  us  to  simplify 
the  value  update  step  from  the  general  “bet  U(s)  :=  maxa,f  A{  j)(c(s,  a')  +  -yU(succ{s,  a')))." ,  that 
we  used  in  the  statement  of  the  value-iteration  algorithm  in  Figure  4.  to  the  more  specific  “Set 
U(s)  :=  r(s,a)  +  yU(succ(s,  a))  =  —  1  -f  yU(succ{s,a))  =  —  1  +  U(suec(s,a))" . 
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Figure  22:  A  domain  for  which  an  admissible,  zero-initialized  value- iteration  algorithm 
can  need  at  least  n2  —  n  steps  to  reach  the  goal  state,  but  an  off-line  algorithm  that 
knows  the  topology  of  the  state  space  needs  only  one  step  to  reach  the  goal  state  (for 
n  >  1) 


and 


where  loop 1  :=  |{t'  6  {0,  ...,t  —  1}  :  s1'  =  sf,+1}|  (the  number  of  identity  actions 
executed  before  t). 

All  results  about  undiscounted  value-iteration  algorithms  can  be  transferred  to  the 
discounted  case,  as  outlined  for  Q-learning. 

Lemma  4  An  admissible  value-iteration  algorithm  reaches  a  goal  state  and  terminates 
after  at  most  2  <?d(s)  <  n2  —  n  steps.  If  the  state  space  has  no  identity  actions,  it 

reaches  a  goal  state  and  terminates  after  at  most  fists 9^(s)  <  0.5rc2  —  0.5 n  steps. 

Both  bounds  are  tight  for  zero-initialised  value-iteration,  as  shown  in  Figures  22  and 
23.  (It  is  important  to  keep  in  mind  that  the  agent  can  only  detect  whether  its  current 
state  is  a  goal  state.  Therefore,  it  cannot  detect  tbst  state  n  is  a  goal  state  when  it  is 
in  state  1.)  Note  that  the  worst-case  complexity  doubles  if  identity  actions  are  present. 
However,  the  big-0  worst-case  complexity  is  the  same  in  both  cases. 

Theorem  9  An  admissible  »■*  h  r  deration  algorithm  reaches  a  goal  state  and  termi¬ 
nates  after  at  most  O(nd)  st  ps. 

This  theorem  applies  to  admissible  value-iteration  algorithms  that  are  zero-initialized. 
Note  that  the  worst-case  complexity  does  not  depend  on  e,  whereas  the  analogous 
complexity  for  Q-learning  does,  see  Theorem  2.  This  is  so,  because  value-iteration 
always  prefers  duplicate  actions  equally,  since  duplicate  actions  have  the  same  outcome 
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Figure  23:  A  domain  for  which  an  admissible,  zero-initialized  value-iteration  algorithm 
(and  every  other  algorithm  that  has  to  enter  a  state  at  least  once  before  it  knows  the 
successor  states)  can  need  at  least  1/2 n2  —  1/2 n  steps  to  reach  the  goal  state,  but  an 
off-line  algorithm  that  knows  the  topology  of  the  state  space  needs  only  one  step  to 
reach  the  goal  state  (for  n  >  1) 


Figure  24:  A  domain  for  which  an  admissible,  zero- initialized  value-iteration  algorithm 
can  need  at  least  3/16n2— 3/4  steps  to  reach  the  goal  state  (for  n  >  1  with  n  mod  4  =  2) 


and  actions  are  evaluated  according  to  the  17-values  of  their  outcomes.  Thus,  value- 
iteration  does  not  necessarily  need  to  execute  every  action  at  least  once.  Q-learning,  in 
contrast,  evaluates  actions  according  to  their  Q-values,  and  the  Q-values  of  duplicate 
actions  can  differ. 

0(nd)  <  0(n2),  since  d  <  n  —  1.  The  worst-case  complexity  of  0(n2)  is  tight  even 
if  the  state  space  has  no  duplicate  actions,  a  constant  individual  upper  action  bound, 
and  is  1-step  invertible,  see  the  rectangular  gridworld  in  Figure  24. 

Every  algorithm  that  has  to  enter  a  state  at  least  once  before  it  knows  the  successor 
states  of  that  state  has  a  worst-case  complexity  of  at  least  0(n2),  see  Figure  23.  This 
holds  even  if  the  state  space  has  no  duplicate  actions  and  a  constant  individual  upper 
action  bound,  see  Figure  25,  and  thus  these  algorithms  are  not  competitive.  This 
includes  algorithms  that  learn  a  map  of  the  state  space.  However,  if  the  state  space 
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Figure  25:  A  domain  for  which  an  admissible,  zero-initialized  value-iteration  algorithm 
(and  every  other  algorithm  that  has  to  enter  a  state  at  least  once  before  it  knows  the 
successor  states)  can  need  at  least  1/4 n2  —  1  steps  to  reach  the  goal  state  (for  even 
n  >  1) 


is  1-step  invertible,  then  the  agent  can  identify  which  action  inverts  the  action  that 
it  executed  last  and  thus  use  chronological  backtracking  to  reach  a  goal  state.  This 
decreases  the  worst-case  complexity  to  O(n),  since  the  agent  leaves  every  state  at  most 
once.  No  algorithm  can  do  better  in  the  worst  case,  see  for  example  Figure  18.  However, 
value-iteration  does  not  achieve  this  complexity,  as  shown  above  with  Figure  24. 

Similarly  to  Q-Iearning,  it  holds  that  a  discounted,  one-initialized  value- iteration  al¬ 
gorithm  with  goal-reward  representation  behaves  exactly  like  an  undiscounted,  (minus 
one)-initialized  value-iteration  algorithm  with  action-penalty  representation  if  ties  are 
broken  in  the  same  way. 


Theorem  10  A  discounted,  one-initialized  value-iteration  algorithm  with  goal-reward 
representation  reaches  a  goal  state  and  terminates  after  at  most  O(nd)  steps  if  its  action 
selection  step  is  “a  :=  argmaxa,^A^(r(s,a')  -I-  U{succ(s,a'))).”  and  its  value  update 
step  is  “ Set  U(s)  :=  r(s,a)  +  -fU(succ(s,a)).” 


All  of  the  above  complexity  results  can  be  utilized  to  prove  worst-case  complexities 
for  Q-learning,  since  value- iteration  behaves  like  Q-learning  in  a  slightly  modified  state 
space.  Assume  that  Q-learning  operates  in  a  domain  with  states  S,  start  state  sstart, 
goal  states  G,  actions  A(s)  for  s  G  5,  successor  function  succ,  and  initial  Q-values 
Q(s,a)  for  s  €  S  and  a  €  A(s).  We  refer  to  this  domain  in  the  following  as  the  original 
domain.  Consider  the  following  transformed  domain: 


S 

8  start 

G 

_ m 

succ(s ,  a) 


{ss,0  :  s  €  S,a  G  A(s)} 

ss,tart,argmaxa€A(a3tart-)Q(3,taTt,a) 

{•ss,a  :  s  6  G,  a  €  A(s)} 

A(succ(s,a))  for  all  s  =  sJia  G  S 

Ssucc{3, a), a  for  a11  5  =  s*.a  €  5  and  5  €  A(s) 
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(a)  Original  domain 


start 


Figure  26:  A  domain  and  its  transformation 


where  the  sJ)a  are  new  states.  Basically,  the  actions  of  the  original  domain  become  the 
the  states  of  the  transformed  one.  The  potential  successor  states  of  such  an  “action 
state”  s3<a  are  the  “action  states”  that  correspond  to  the  actions  that  can  be  executed 
immediately  after  the  execution  of  a  in  s,  i.e.  the  set  A(succ(s,a)).  Note  that  the 
transformed  domain  satisfies  all  properties  that  we  require  in  this  report  to  hold  for 
state  spaces.  In  particular,  it  is  strongly  connected.  Note  also  that  the  size  of  the 
transformed  domain  n  equals  the  total  number  of  actions  e  of  the  original  domain. 
Similarly,  the  depth  d  of  the  transformed  domain  is  either  equal  to  either  the  depth 
of  the  original  domain  d  or  to  d  +  1.  An  example  for  an  original  domain  and  the 
transformed  domain  that  corresponds  to  it  is  given  in  Figure  26. 

Assume  that  action  penalty  representation  and  no  discounting  are  used.  Admissible 
value-iteration  behaves  in  the  transformed  domain  identically  to  admissible  Q-learning 
in  the  original  one  provided  that 
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•  the  original  domain  has  no  identity  actions, 

•  initially  <5(s,a)  =  U(sSia)  for  all  s  6  S  and  a  €  A(s),  and 

•  ties  are  broken  in  the  same  way. 


During  the  execution  of  the  algorithms  it  always  holds  that  Q(s,a )  =  (7(sa,a)  for  all 
s  €  S  and  a  6  A(s).  Furthermore,  value- iteration  always  chooses  the  same  actions  as 
Q-learning,  and  needs  the  same  number  of  steps  to  terminate.  Note  that  the  Q*values 
of  the  original  domain  are  consistent  (admissible)  iff  the  (/-values  of  the  transformed 
domain  are  consistent  (admissible). 

The  similarity  in  the  behavior  of  value-iteration  and  Q-learning  can  be  used  to  trans¬ 
fer  theorems  about  one  algorithm  to  the  other  one.  For  example,  Theorem  9  states 
that  value-iteration  has  a  worst-case  complexity  of  0(nd)  =  0(ed)  in  the  transformed 
dopiain.  Thus,  this  is  the  worst-case  complexity  of  Q-learning  in  the  original  domain 
as  well,  which  confirms  Theorem  2.  (As  Theorem  2  shows,  the  conditions  that  guaran¬ 
tee  equal  behavior  of  value-iteration  and  Q-learning  can  be  generalized  to  discounted 
algorithms  and  state  spaces  that  contain  identity  actions.) 


6.2  Finding  Optimal  Policies  with  Value-Iteration 

Korf  showed  that  the  LRTA*  algorithm  identifies  an  optimal  path  from  a  given  start 
state  to  a  set  of  goal  states  in  the  limit  if  the  agent  is  automatically  reset  to  the 
start  state  when  it  reaches  a  goal  state.  Our  bi-directional  (1-step)  value-iteration 
algorithm,  see  Figure  27,  is  more  general  than  Korf’s  algorithm  in  that  it  finds  shortest 
paths  from  all  states  to  a  goal  state,  does  not  need  to  have  reset  actions  available,  and 
terminates.  ub(d)  is  again  an  upper  bound  on  the  depth  of  the  state  space  d.18  See 
[14]  for  details  on  the  bi-directional  value-iteration  algorithm  (and  how  it  can  be  made 
more  efficient,  unfortunately  without  decreasing  its  big-0  worst-case  complexity). 


Theorem  XI  The  bi-directional  value-iteration  algorithm  finds  an  optimal  policy  and 
terminates  after  at  most  0(n  x  ub(d))  steps. 


If  ub(d)  scales  linearly  with  the  correct  value  of  d,  then  0(n  x  ub(d))  =  O(nd)  <  0(n2). 
That  this  bound  is  tight  even  if  the  state  space  has  no  duplicate  actions,  a  constant 

18The  bi-directional  value-iteration  algorithm  corresponds  to  the  bi-directional  Q-learning  algorithm 
(version  1).  As  in  the  case  of  that  Q-learning  algorithm,  the  upper  bound  on  the  depth  of  the  state 
space  ub(d)  is  needed  only  to  provide  a  termination  criterion  for  the  bi-directional  value-iteration 
algorithm.  If  we  do  not  require  the  agent  to  be  memoryless,  then  it  is  easy  to  construct  a  version  of 
the  bi-directional  value-iteration  algorithm  that  does  not  need  to  know  an  upper  bound  on  the  depth 
of  the  state  space  or  any  other  information  about  the  state  space  in  advance  and  has  a  tight  worst-case 
complexity  of  0{nd).  This  can  be  done  along  the  lines  outlined  in  Chapter  5.2  for  the  bi-directional 
Q-learning  algorithm  (version  2). 
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Initially,  U/(s)  =  Ub(s)  =  0  and  done(s)  =  false  for  all  s  6  S. 
/*  Also,  no  steps  have  been  taken  so  far,  i.e.  t  =  0.  */ 


1.  Set  s  :=  the  current  state. 

2.  If  s  €  G,  then  set  done(s)  :=  true. 

3.  If  done(s)  =  true ,  then  go  to  8. 

4.  /*  forward  step  */ 

Set  a  :=  argmaxa,e^s)I7/(.succ(.s,a'))- 

5.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  —1  and  is  in  state  succ(s,a).  Set 
t  :=  t  +  1.  */ 

6.  Set  Uf(s)  :=  —1  +  Uj(succ(s,a))  and  done(s)  :=  done(succ(s,  a)). 

7.  Go  to  1. 

8.  /*  backward  step  */ 

Set  a  :=  argmaxa,€/1(a)(74(succ(s,  a')). 

9.  Execute  action  a. 

/*  As  a  consequence,  the  agent  receives  reward  —1  and  is  in  state  succ(s,a).  Set 
t:=t  +  1.  7 

10.  Set  Ub(s)  :=  —1  +  Ub(succ(s,a)). 

11.  If  Ub(s)  <  —ub(d),  then  stop. 

12.  Go  to  1. 


Figure  27:  The  bi-directional  value- iteration  algorithm 


individual  upper  action  bound,  and  is  1-step  invertible,  follows  from  Figure  24,  since 
finding  optimal  policies  cannot  be  easier  than  reaching  a  goal  state. 

Every  algorithm  that  has  to  enter  a  state  at  least  once  before  it  knows  the  successor 
states  of  that  state  lias  a  worst-case  complexity  of  at  least  0(n2)  for  finding  an  optimal 
policy  even  if  the  state  space  has  no  duplicate  actions  and  a  constant  individual  upper 
action  bound,  see  for  example  Figure  25.  However,  if  one  considers  only  Eulerian  state 
spaces,  then  the  Deng-Papadimitriou  algorithm,  that  has  a  worst-case  complexity  of 
only  0(e),  can  be  used  to  learn  a  map  that  is  then  used  for  finding  an  optimal  policy. 
Thus,  this  algorithm  needs  at  most  0(n)  steps  for  finding  optimal  policies  in  Eulerian 
state  spaces  with  a  linear  total  upper  action  bound,  whereas  the  bi-directional  value- 
iteration  algorithm  can  need  0(n2)  steps,  as  shown  above  with  Figure  24.  Similarly, 
if  one  considers  only  1-step  invertible  state  spaces,  chronological  backtracking  has  a 
complexity  of  0(n),  but  bi-directional  value-iteration  can  still  need  0(n 2)  steps,  as 
shown  with  Figure  24  as  well. 


7  Empirical  Results 

Until  now,  we  have  been  concerned  with  the  worst-case  complexity  of  reinforcement 
learning  algorithms.  However,  for  practical  purposes  their  average-case  complexities 
are  equally  important.  In  this  chapter,  we  present  a  brief  case  study  of  the  behavior 
of  various  uninformed  reinforcement  learning  algorithms  in  three  simple  domains: 

•  reset  state  spaces  (see  Figure  8)  of  sizes  n  6  [2,50], 

•  one-dimensional  gridworlds  (see  Figure  18)  of  sizes  n  €  [2,50],  and 

•  two-dimensional,  quadratic  gridworlds  without  obstacles  (see  Figure  16  for  an 
example  with  obstacles)  of  sizes  n  6  [4, 196]  that  have  the  start  state  in  the 
upper  left-hand  corner  and  the  goal  o>ate  in  the  lower  right-hand  corner. 


The  one-dimensional  and  quadratic  gridworlds  are  1-step  invertible,  but  the  reset  state 
space  is  not.  The  depth  of  the  one-dimensional  gridworld  and  the  reset  state  space 
scales  linearly  with  n,  since  d  =  n  —  1  in  both  cases.  For  the  quadratic  gridworld, 
however,  d  scales  sublinearly  with  n,  because  d  =  2y/n  —  2.  Ail  three  domains  have  no 
duplicate  actions,  a  constant  individual  upper  action  bound,  and  polynomial  width. 

The  run-times  of  the  reinforcement  learning  algorithms  (i.e.  the  number  of  steps 
needed)  are  shown  in  Figures  28,  29,  and  30,  respectively.  All  graphs  are  scaled  in 
the  same  proportion. 

The  x-axis  shows  the  complexity  of  the  domain  (measured  as  ed)  and  the  y-axis  the 
run-time  (measured  as  number  of  steps  needed  to  complete  the  tasks).  The  expected 
run-time  of  a  random  walk  (i.e.  of  a  zero-initialized  Q-learning  or  value-iteration 
algorithm  with  goal-reward  representation)  is  determined  analytically.  The  run-times 
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Figure  28:  Run-times  in  the  reset  state  space  (as  a  function  of  ed  =  e(n  —  1)) 


of  the  other  reinforcement  learning  algorithms  (we  use  zero-initialized  algorithms  with 
action-penalty  representation)  are  averaged  over  5000  runs,  with  ties  broken  randomly. 
For  finding  optimal  policies,  we  use  either  the  bi-directional  value-iteration  algorithm, 
the  bi-directional  Q-learning  algorithm  (version  1),  or  its  version  2.  We  label  the  last 
case  with  “no  ub(d)n,  since  the  bi-directional  Q-learning  algorithm  (version  2)  does  not 
need  to  know  an  upper  bound  on  the  depth  of  the  state  space  in  advance.19  In  the 
first  two  cases,  we  use  either  ub(d)  —  d  or  ub(d)  =  n  —  1  as  upper  bounds  on  the  depth 
of  the  state  space.  In  all  three  cases,  we  distinguish  two  performance  measures:  the 
number  of  steps  until  Uj{s )  =  Uopt(s )  for  every  s  €  S  (i.e.  until  an  optimal  policy  is 
identified),  and  the  number  of  steps  until  the  algorithm  realizes  that  and  terminates. 
For  identifying  an  optimal  policy,  it  does  not  matter  which  version  of  the  bi-directional 
Q-learning  algorithm  is  used. 

Note  that  value-iteration  needs  exactly  n  —  1  steps  to  reach  the  goal  state  in  reset 
state  spaces  and  one-dimensional  gridworlds,  the  best  number  of  steps  possible.  These 
graphs  are  thus  very  close  to  the  x-axis.  Also,  the  bi-directional  Q-learning  algorithm 
(version  2)  terminates  (almost)  immediately  after  an  optimal  policy  has  been  identified. 
Thus,  the  corresponding  graphs  are  very  close. 

Note  that  d  and  n  —  I  are  identical  for  the  reset  state  space  and  the  one-dimensional 
gridworld,  but  they  differ  for  the  quadratic  gridworld.  Thus,  ub{d)  =  d  and  ub(d)  = 
n  —  1  collapse  for  the  former  two  domains  and,  furthermore,  it  does  not  matter  whether 

19We  do  not  include  the  results  about  the  corresponding  version  of  the  bi-directional  value-iteration 
algorithm,  since  we  have  not  described  it  in  this  report. 
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Figure  29:  Run-times  in  the  one-dimensional  gridworld  (as  a  function  of  ed  =  t(n  —  1)) 


we  use  ed  or  en  as  a  measure  for  the  complexity  of  the  domain.  For  the  latter  domain, 
however,  d  scales  sublinearly  with  n.  Thus,  it  does  make  a  difference  whether  we  use 
ub(d)  =  d  or  ub(d)  =  n  —  1  and  whether  we  use  ed  or  en  as  a  measure  for  the  domain 
complexity.  Figure  31  contains  the  same  data  as  Figure  30,  except  that  it  uses  en  for 
the  domain  complexity  instead  of  ed. 

In  the  following,  we  investigate  how  the  run-times  of  the  algorithms  scale  with  the 
complexity  of  the  domains.  We  have  shown  in  the  previous  chapters  that  the  worst- 
case  bounds  of  the  following  tasks  can  be  at  most  linear  in  ed: 

•  reaching  a  goal  state  with  random  walks  in  Eulerian  state  spaces  on  average, 

•  reaching  a  goal  state  with  (efficient)  Q-learning  (i.e.  using  either  zero-initialized 
Q-learning  with  action-penalty  representation  or  one-initialized  Q-learning  with 
goal-reward  representation), 

•  identifying  an  optimal  policy  with  (efficient)  Q-learning,  and 

•  identifying  an  optimal  policy  and  terminating  with  (efficient)  Q-learning  if  either 
version  1  is  used  and  ub(d)  scales  at  most  linearly  with  d  or  version  2  is  used. 


Since  d  <  n  —  1,  these  bounds  are  also  at  most  linear  in  en.  If  d  scales  sublinearly  with 
n,  they  are  even  sublinear  in  en. 
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Figure  30:  Run-times  in  the  quadratic  gridworld  (as  a  function  of  ed) 


Remember  that  the  worst-case  complexity  for  reaching  a  goal  state  with  efficient  Q- 
learning  depends  on  J2ses\G  TlaeA(s){9d(succ(s,a))  +  l)  (see  Lemma  2),  of  which  e(d-t-l) 
and  en  are  upper  bounds.  Table  1  contains  these  values  for  the  three  domains. 


domain 

e(d  +  1) 

en 

reset  state  space 

1.5n2  -  2.5n 

2 n2  -  2n 

2n2  -  2n 

one-dimensional  gridworld 

n?  -  3 

2n2  -  2n 

2n2  -  2n 

quadratic  gridworld 

4n3/2  —  4n  —  4 

gn3/2  _  +  4nJ/2 

4n2  -  4n3/2 

Table  1:  Domain  characteristics 


Thus, 

0(  H  (gd(succ(s,a))  +  1))  =  0{ed)  =  0{en) 

s€S\G  ag/l(s) 

for  reset  state  spaces  and  one-dimensional  gridworlds,  but 

0{  ^2  H  (gd(succ(s,  a))  -t-  1))  =  O(ed)  <  0{en) 

*€S\Ga€>»(*) 

for  quadratic  gridworlds,  since  d  is  linearly  proportional  to  n  for  the  former  two  do¬ 
mains,  but  sublinearly  proportional  for  the  latter  one.  In  all  three  cases,  ed  is  linearly 
proportional  to  £jSs\g  <*))  -I-  1)  and  therefore  a  good  measure  for 
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Figure  31:  Run-times  in  the  quadratic  gridworld  (as  a  function  of  en) 


the  worst-case  complexity  of  reaching  a  goal  state  with  efficient  Q-learning.  Although 
this  does  not  need  to  be  the  case,  it  seems  to  hold  for  almost  all  domains  of  practical 
interest. 

The  worst-case  bound  of  the  following  task  can  be  at  most  linear  in  en: 

•  identifying  an  optimal  policy  and  terminating  with  (efficient)  Q-learning  if  ver¬ 
sion  1  is  used  and  ub(d)  scales  at  most  linearly  with  n. 

The  worst-case  bounds  of  the  following  tasks  can  be  at  most  linear  in  nd : 


•  reaching  a  goal  state  with  (efficient)  value-iteration, 

•  identifying  an  optimal  policy  with  (efficient)  value-iteration,  and 

•  identifying  an  optimal  policy  and  terminating  with  (efficient)  value- iteration  if 
ub(d)  scales  at  most  linearly  with  d. 


Since  n  <  e,  these  bounds  are  also  at  most  linear  in  ed.  They  can  be  exactly  linear  in 
ed  if  the  state  space  has  a  linear  total  upper  action  bound.  The  bounds  are  at  most 
linear  in  n2,  because  d  <  n  —  1.  If  d  scales  sublinearly  with  n,  they  are  sublinear  in  n2. 

The  worst-case  bound  of  the  following  task  can  be  at  most  linear  in  n2: 
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•  identifying  an  optimal  policy  and  terminating  with  (efficient)  value-iteration  if 
ub(d)  scales  at  most  linearly  with  n. 


Since  n  <  e,  the  bound  is  also  at  most  linear  in  en.  It  can  be  exactly  linear  in  en  if 
the  state  space  has  a  linear  total  upper  action  bound. 

Although  the  worst-case  complexity  of  an  algorithm  provides  an  upper  bound  for  its 
average-case  complexity,  the  average-case  complexity  does  not  necessarily  scale  linearly 
with  the  worst-case  complexity.  Therefore,  it  is  interesting  to  investigate  in  which  of 
the  cases  not  only  the  worst-case  complexity,  but  also  the  average-case  complexity 
scales  linearly  with  the  complexity  of  the  domain. 

Table  2  shows  which  graphs  of  the  average-case  complexity  deviate  from  the  corre¬ 
sponding  graphs  of  the  worst-case  complexity.  The  first  entry  is  always  the  shape  of 
the  graph  for  the  worst-case  complexity  (as  stated  above,  i.e.  taking  into  account  only 
the  general  properties  of  the  state  space  such  as  being  1-step  invertible  etc.),  the  sec¬ 
ond  entry  is  the  one  of  the  graph  for  the  average-case  complexity.  We  call  a  graph 
superlinear  if  it  has  a  positive  second  derivative,  linear  if  its  second  derivative  is 
zero,  and  sublinear  if  its  second  derivative  is  negative.20 

This  table  demonstrates  that  -  at  least  for  the  domains  tested  -  the  worst- case  complex¬ 
ity  of  the  algorithms  for  finding  optimal  policies  behaves  similar  to  their  average-case 
complexity.  Note,  however,  that  the  coefficients  for  an  algorithm  (i.e.  the  slopes  of  the 
graphs)  are  domain  dependent. 

Figure  32  shows  the  complexity  (measured  as  ed)  and  Figure  33  the  sizes  (measured  as 
n)  of  domains  for  which  the  algorithms  require  1000  steps  on  average.  We  use  a  simple 
linear  approximation  scheme  between  data  points  for  cardinal  n.  (Reaching  goal  states 
with  value-iteration  needs  so  few  steps  that  its  graph  is  far  above  the  other  graphs  and 
therefore  not  contained  in  the  figures.)  These  graphs  confirm  our  expectations  about 
the  algorithms: 

•  We  expect  the  run-time  for  reaching  a  goal  state  to  be  smaller  than  the  run-time 
for  finding  an  optimal  policy  which  we  expect,  in  turn,  to  be  smaller  than  the 
run-time  for  terminating  with  an  optimal  policy,  given  that  the  same  algorithm, 
task  representation,  and  initialization  is  used  in  all  cases:  Reaching  a  goal  state 
is  a  prerequisite  for  finding  an  optimal  policy,  which  is  necessary  for  terminating 
with  an  optimal  policy. 

•  We  also  expect  a  Q-learning  algorithm  with  “no  ub (d)”  to  terminate  with  an 
optimal  policy  earlier  than  a  Q-learning  algorithm  with  ub(d)  =  u&i,  which  we 

20The  shapes  of  the  graphs  for  the  average-case  complexity  were  determined  by  visual  inspection 
of  the  graphs  only.  Therefore,  there  is  for  example  a  chance  that  a  non-linear  shape  appeared  linear, 
or  that  a  sublinear  or  superlinear  shape  was  the  result  of  lower  order  terms.  For  example,  the  graph 
( x,f(x ))  =  (n2,n2  +  n)  for  n  €  [1,10]  is  sublinear,  although  n2  <  n2  +  n  <  2 n2  and  therefore 
0(n2)  =  0(n 2  +  n).  Of  course,  for  large  values  of  the  independent  variable  n,  the  graph  approaches 
a  linear  shape.  Unfortunately,  we  cannot  be  sure  that  in  our  experiments  n  was  large  enough. 
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•  linear 
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linear 
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linear 

domain 
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domain  complexity 
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en 
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Table  2:  Shapes  of  complexity  graphs 


expect,  in  turn,  to  terminate  earlier  than  a  Q-learning  algori4  hm  with  ub(d)  =  ub2 
for  all  d  <  ubi  <  ub2.  Similarly,  we  expect  a  value-iterat’on  algorithm  with 
ub(d)  =  ub\  to  terminate  earlier  than  a  value- iteration  algorithm  with  ub(d)  = 
ub-2,  for  all  d  <  ubx  <  ub2:  We  have  proved  that  the  bi-directional  Q-learning 
algorithm  (version  2)  always  terminates  with  an  optimal  policy  no  later  than  any 
bi-directionai  Q-learning  algorithm  (version  1),  provided  that  ties  are  broken  in 
the  same  way.  Also,  Q-learning  and  value- iteration  terminate  with  an  optimal 
policy  the  eailier,  the  smaller  ub(d),  since  then  the  earlier  a  (/-value  drops  below 
— «6(d),  which  terminates  the  algorithm. 

•  Furthermore,  we  expect  the  run-time  of  the  efficient  value- iteration  algorithm 
to  be  smaller  than  the  run-time  of  the  efficient  Q-learning  algorithm  which  we 
expect  to  be  smaller  than  the  run-time  of  a  random  walk,  given  the  same  task 
to  be  solved  and,  for  terminating  with  an  optimal  policy,  the  same  ub(d)  used: 
Random  walks  do  not  remember  the  topology  of  the  state  space  at  all.  Efficient 
value- iteration  and  efficient  Q-learning  both  remember  the  topology  partially,  but 
value-iteration  is  more  powerful,  since  it  uses  an  action  model. 

•  In  addition  to  the^e  relationships,  the  graphs  show  that  random  walks  are  in¬ 
efficient  in  reset  state  spaces,  but  perform  much  better  in  one-dimensional  and 
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Figure  32:  Complexity  of  domains  that  can  be  solved  in  1000  steps 


quadratic  gridworlds,  since  the  latter  are  Eulerian  and  therefore  easier  domains 
for  random  walks  than  the  malicious  reset  state  spaces,  where  the  agent  has  to 
choose  the  correct  action  (out  of  two  possible  actions)  n  —  2  times  in  a  row  in 
order  to  succeed.  But  even  for  gridworlds,  the  efficient  Q-learning  algorithms 
continue  to  perform  better  than  random  walks,  since  they  remember  information 
about  the  topology  of  the  state  space,  whereas  random  walks  do  not. 


8  Extensions 

The  results  presented  in  this  report  can  easily  be  extended  to  cases  where  the  actions 
do  not  hav°  the  same  reward,  or  prior  knowledge  of  the  topology  of  the  state  space  is 
available. 


•  Prior  knowledge  (in  form  of  suitable  initial  Q-values,  i.e.  consistent  or  admissible 
Q- values  that  are  non-zero)  makes  the  Q-learning  algorithms  better  informed  and 
can  decrease  their  run-times,  as  can  easily  be  seen  from  Lemma  2.  For  example,  in 
the  totally  informed  case,  the  Q-values  are  initialized  as  follows  if  action-penalty 
representation  and  no  discounting  are  used: 


Q{s,a) 


{V 


if  s  €  G 

gd(succ(s,a))  otherwise 


for  all  s  6  5  and  a  €  ,4(s^ 
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Figure  33:  Sizes  of  domains  that  can  be  solved  in  1000  steps 


Lemma  2  predicts  in  this  case  that  the  agent  needs  only  at  most  —U{s)  —  gd(s) 
steps  to  reach  a  goal  state  from  a  given  s  €  S',  the  best  number  of  steps  possible. 

In  many  cases,  admissible  heuristics  (for  A*-search)  are  known  for  the  goal  dis¬ 
tances.  As  explained  in  Chapter  4.2.1,  such  heuristics  allow  one  to  initialize  the 
(^-values  with  non-zero  values  and  thus  make  the  Q-learning  algorithms  better 
informed.  Similarly,  suitable  initial  (/-values  make  the  value-iteration  algorithm 
better  informed.  Usually,  the  initial  knowledge  will  be  in  between  the  two  ex¬ 
tremes  of  being  totally  uninformed  and  totally  informed. 


•  Under  the  action-penalty  representation,  every  action  has  an  immediate  cost  of 
one.  The  results  presented  in  this  report  can  easily  be  adapted  to  the  case  where 
actions  do  not  have  the  same  immediate  rewards,  but  arbitrary  strictly  negative 
immediate  rewards.  In  this  case,  the  complexity  of  reaching  a  goal  state  with 
admissible  Q-learning  (with  an  adequately  adapted  definition  of  admissibility) 
becomes  0(  =^-ed)  instead  of  0{ed),  where  rm;n  is  the  smallest  immediate  reward 
(i.e.  the  one  with  the  largest  absolute  value)  and  fmax  is  the  largest  immediate 
reward.  If  one  defines  the  weighted  distance  dw(s ,  s')  between  s  €  S  and  s'  6  5" 
to  be  the  (unique)  solution  of  the  following  set  of  equations 


dyj  ( *3 ,  S  ) 


/  0 

(  mina6/4(i)(-r(s,a)  +  dw{succ{s, a), s')) 


if  s  =  s' 
otherwise 


for  all  s,s'  6  5, 


and  the  weighted  depth  of  the  state  space  to  be  dw  :=  max,y^s dw{s,s'), 
then  one  can  prove  the  even  tighter  bound  0(  j=-* — ied„,).  (The  proofs  are  analo- 
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gous  to  the  proofs  of  Lemma  1,  Lemma  2,  and  Theorem  2.)  Similarly,  the  com¬ 
plexity  of  reaching  a  goal  state  with  admissible  value-iteration  becomes  0(=mn-nd.) 
or,  alternatively,  0( r-  1— ? ndw)  instead  of  0(nd). 

'  l^mai  | 


9  Further  Problems 

Reinforcement  learning  algorithms  can  not  only  be  used  in  deterministic  state  spaces, 
but  also  in  state  spaces  with  probabilistic  action  outcomes.  The  assumption  usually 
made  is  that  the  relative  complexities  of  different  algorithms  in  deterministic  domains 
predict  their  performance  in  probabilistic  domains.  However,  three  problems  arise  in 
the  latter  case  that  must  be  dealt  with: 


•  In  probabilistic  domains,  reinforcement  learning  algorithms  do  not  only  have  to 
learn  the  potential  outcomes  of  actions,  but  the  transition  probabilities  as  well, 
either  explicitly  (as  done  by  value-iteration)  or  implicitly  (as  done  by  Q-learning). 
For  Q-learning,  the  problem  arises  that  the  learning  rate  a  can  no  longer  be  set 
to  one,  and  one  has  to  execute  an  action  a  in  a  state  s  repeatedly  in  order  for 
Q(s,a)  to  converge  even  if  the  (/-values  of  all  potential  outcomes  of  a  in  s  are 
already  correct.  This  also  implies  that  it  can  take  a  long  time  for  Q-learning  to 
decrease  large  initial  Q- values  to  their  final  values,  since  the  Q-values  are  only 
adjusted  in  small  increments.  Similarly,  in  case  the  transition  probabilities  are 
represented  explicitly,  a  number  of  samples  larger  than  one  is  needed  to  estimate 
each  probability  reliably.  [11]  and  [32]  propose  heuristics  that  address  these 
issues. 

•  In  probabilistic  domains,  admissible  Q-values  do  not  necessarily  remain  admissi¬ 
ble.  This  is  due  to  the  fact  that  the  transition  probabilities  are  unknown  and  only 
estimates  are  available.  These  estimates  can  deviate  from  the  correct  values  a  lot. 
although  their  difference  approaches  zero  with  probability  one  when  the  sample 
size  goes  towards  infinity  (provided  that  a  reasonable  estimation  method  is  used). 
As  a  consequence,  it  can  happen  that  U(s)  <  Uopt{s)  for  a  state  s  even  if  the 
(/-values  of  the  potential  outcomes  of  all  actions  in  s  remain  admissible.  For  ex¬ 
ample,  consider  undiscounted,  zero-initialized  value-iteration  with  act  ion- penalty 
representation  and  assume  that  the  transition  frequencies  are  used  to  estimate 
the  transition  probabilities  (i.e.  maximum-likelihood  estimation  is  used).  Fur¬ 
ther  assume  that  there  is  a  state  s  with  A(s)  =  {a},  executing  a  in  s  results  in 
state  si  with  probability  0.5  and  in  state  s2  with  the  complementary  probability, 
and  Uopt(si)  <  Uopt{s2).  If  U{s i)  =  Uopt{si),  U[s2)  =  Uopt{s2),  and  executing  a 
in  s  has  resulted  n,\  times  in  the  successor  state  sq  and  n2  times  in  the  successor 
state  s2  where  n\  >  n2,  then 
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after  updating  s.  If  U(s)  <  Ucpt(s ),  >,hen  s  looks  worse  than  it  is,  which  can 
make  the  agent  avoid  executing  actions  that  have  s  as  an  outcome.  [11]  calls  this 
phenomenon  “sticking”. 

•  In  probabilistic  domains,  optimal  policies  can  have  cycles  (i.e.  the  agent  visits 
the  same  state  more  than  once  with  non-zero  probability).  This  is  another  reason 
why  it  can  take  Q-learning  (or  value-iteration)  a  long  time  to  decrease  large  initial 
Q-values  (or  17-values).  It  also  implies  that  the  average  number  of  steps  that  are 
required  to  reach  a  goal  state  when  executing  the  optimal  policy  can  already  be 
exponential  in  n  [33].  In  this  case,  exploration  is  clearly  exponential  and  one  has 
to  factor  out  the  inherent  complexity  of  the  state  space  from  the  complexity  of 
the  learning  algorithm. 


To  summarize,  more  research  is  required  to  transfer  the  results  from  deterministic  to 
probabilistic  state  spaces.  This  is  part  of  our  current  research  activities.  However,  the 
results  reported  here  have  already  proved  useful  for  an  analysis  of  various  extensions 
of  reinforcement  learning  techniques.  For  example,  [18]  utilizes  them  to  analyze  the 
complexity  of  hierarchical  Q-learning. 


10  Conclusion 


Many  real-world  domain?  have  the  characteristic  of  the  task  presented  here  —  the  agent 
must  reach  one  of  a  number  of  goal  states  by  taking  actions,  but  the  initial  topology  of 
the  state  space  is  unknown.  Prior  results  which  indicated  that  reinforcement  learning 
algorithms  performed  random  walks  until  they  reach  a  goal  state  for  the  first  time 
and  therefore  were  exponential  in  n,  the  size  of  the  state  space,  seemed  to  limit  their 
usefulness  for  such  tasks. 

This  report  has  shown,  however,  that  such  algorithms  are  tractable  when  using  either  an 
appropriate  task  representation  (the  action-penalty  representation)  or  suitable  initial 
Q-values.  Both  changes  produce  a  dense  reward  structure,  which  facilitates  learning. 
In  particular,  we  showed  that  Q-learning  needs  at  most  O(ed)  steps  to  find  a  path 
from  the  start  state  to  a  goal  state,  or  at  most  0(n 3)  steps  if  the  state  space  has 
no  duplicate  actions.  These  bounds  are  tight.  Moreover,  every  uninformed  on-line 
search  algorithm  has  the  same  big-0  worst-case  complexity.  We  showed,  however,  that 
learning  (and  subsequently  using)  a  map  of  the  state  space  can  decrease  the  big-0 
worst-case  complexity  for  some  (but  not  all)  domains:  additional  planning  between 
action  executions  can  reduce  the  number  of  action  executions  by  more  than  a  constant 
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factor  if  one  is  willing  to  tolerate  an  increase  in  deliberation  time  between  action 
executions.  But  the  complexity  results  also  show  that  even  if  one  knew  the  topology  of 
the  state  space  in  advance  and  performed  Q-learning  anyway,  one  would  only  increase 
the  number  of  action  executions  by  a  factor  of  0(en)  at  most. 

We  have  shown  that  both  initial  knowledge  of  the  topology  of  the  domain  (in  form 
of  initial  (^-values  that  are  admissible,  but  non-zero)  and  domain  properties  such  as 
having  no  identity  actions,  no  duplicate  actions,  a  constant  individual  upper  action 
bound,  a  linear  total  upper  action  bound,  polynomial  width,  being  1-step  invertible  or 
Eulerian  can  decrease  the  complexity  of  the  Q-learning  algorithm  (even  if  the  agent 
does  not  know  whether  a  domain  has  these  properties).  Many  reinforcement  learning 
domains,  for  example  gridworlds,  share  some  or  all  of  these  properties.  Therefore, 
exploration  in  these  domains  actually  has  very  low  complexity.  For  instance,  the  worst- 
case  complexity  of  reaching  a  goal  state  with  Q-learning  in  quadratic  gridworlds  is  only 
0(n3/2). 

In  general,  the  largest  big-0  average-case  complexity  of  a  random  walk  is  much  larger 
than  the  big-0  worst-case  complexity  of  Q-learning,  and  every  uninformed  on-line 
search  algorithm  has  at  least  the  same  big-0  worst-case  complexity  than  Q-learning. 
For  Eulerian  state  spaces,  however,  we  have  shown  that  the  largest  big-0  average-case 
complexity  of  a  random  walk  equals  the  big-0  worst-case  complexity  of  Q-learning. 
We  continue  to  expect  Q-learning  to  outperform  a  random  walk  (although  the  im¬ 
provement  can  no  longer  be  exponential)  since  random  walks  have  no  memory  of  past 
experiences.  However,  there  exist  uninformed  on-line  search  algorithms  that  have  a 
smaller  big-0  worst-case  complexity  than  Q-learning.  They  demonstrate  that  actively 
utilizing  properties  of  the  state  space  can  decrease  the  complexity.  These  results  are 
very  different  from  the  general  case.  This  shows  that  general  results  about  the  com¬ 
plexity  of  reinforcement  learning  algorithms  might  not  be  specific  enough.  It  can  be 
more  interesting  to  identify  specific  properties  of  the  reinforcement  learning  domain  of 
interest  and  investigate  how  they  influence  the  complexity. 

We  have  introduced  the  novel  bi-directional  Q-learning  algorithm  for  finding  shortest 
paths  from  all  states  to  a  goal  state  and  have  shown,  somewhat  surprisingly,  that 
its  complexity  is  O(ed)  as  well.  This  provides  an  efficient  algorithm  to  learn  optimal 
policies. 

All  results  that  we  have  derived  in  this  report  for  Q-learning  can  easily  be  transferred  to 
value-iteration,  which  uses  an  action  model  and  can  therefore  be  expected  to  be  more 
efficient  than  Q-learning.  Undiscounted,  admissible  value-iteration  with  action-penalty 
representation  is  equivalent  to  Korf’s  LRTA*  algorithm  with  a  search  horizon  of  one. 
Our  results  generalize  Korf’s  results  to  state  spaces  that  contain  identity  actions  and 
also  show  the  effects  of  various  domain  properties  and  initial  U- values  that  are  non-zero 
on  the  complexity.  We  have  shown  that  the  behavior  of  Q-learning  in  any  domain  is 
identical  to  the  behavior  of  value-iteration  in  an  adequately  modified  domain.  Our  bi¬ 
directional  value-iteration  algorithm  generalizes  the  LRTA*  algorithm  in  that  it  finds 
shortest  paths  from  all  states  to  a  goal  state,  does  not  need  to  have  reset  actions 
available,  and  terminates. 
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Tight  bounds  on  the  largest  average  number  of  steps  required  for  reaching  a  goal  state 
using  a  zero-initialized  algorithm  with  goal-reward  representation  (the  same  results 
apply  to  finding  optimal  policies) 


State  Space 


Q-Learning  I  Value-Iteration 


general  case  exponential 

no  duplicate  actions  exponential 

linear  total  upper  action  bound  exponential 


exponential 

exponential 

exponential 


Tight  bounds  on  the  number  of  steps  required  in  the  worst  case  for  reaching  a  goal  state 
using  a  zero-initialized  algorithm  with  action-penalty  representation  or  a  one- initialized 
algorithm  with  goal- reward  representation  (the  same  results  apply  to  finding  optimal 
policies) 
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Figure  34:  Complexities  of  Reinforcement  Learning 


The  important  results  (expressed  in  terms  of  n  and  e)  are  summarized  in  Figure  34. 
They  demonstrate  that  undirected  exploration  methods  can  be  tractable.  This  result 
is  supported  by  our  empirical  studies  in  three  different  reinforcement  learning  domains, 
that  also  show  that  ed.  is  a  good  measure  for  the  complexity  of  a  state  space. 

While  some  reinforcement  learning  tasks  cannot  be  reformulated  as  shortest  path  prob¬ 
lems  (for  example,  non-goal-oriented  tasks  where  the  agent  has  to  learn  how  to  behave 
in  the  world),  the  theorems  still  provide  guidance:  the  run-times  can  be  improved  by 
making  the  reward  structure  dense,  for  instance,  by  subtracting  some  constant  c  €  'R 
from  all  immediate  rewards.  This  does  not  change  the  relative  preferences  among  plans, 
since 
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Alternatively,  one  can  use  sufficiently  large  initial  Q-values  for  Q-learning  (or  U- values 
for  value-iteration).  In  both  cases  it  can  be  necessary  to  use  discounting. 

In  summary,  reinforcement  learning  algorithms  are  useful  for  enabling  agents  to  explore 
unknown  state  spaces  and  learn  information  relevant  to  performing  tasks.  The  results 
in  this  report  add  to  that  research  by  showing  that  reinforcement  learning  is  tractable, 
and  therefore  can  scale  up  to  handle  real- world  problems. 
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Appendix 


The  appendix  contains  the  proofs  of  the  lemmas  and  theorems  stated  in  the  main  text 
that  have  not  been  proved  there.  We  use  the  terminology  and  symbols  introduced  in 
the  main  text. 


A  Upper  Bounds 

A.l  Reaching  a  Goal  State  with  Q-Learning 

In  the  following,  we  consider  an  undiscounted,  admissible  Q-learning  algorithm  with 
action-penalty  representation.  The  proofs  can  then  be  transferred  to  the  other  algo¬ 
rithms  (for  example  a  discounted,  admissible  Q-learning  algorithm  with  action-penalty 
representation  or  a  high-initialized,  discounted  Q-learning  algorithm  with  goal-reward 
representation)  as  outlined  in  the  main  text. 

The  time  superscripts  used  in  this  chapter  refer  to  the  values  of  the  variables  imme¬ 
diately  before  the  action  execution  step  of  the  Q-learning  algorithm,  i.e.  line  4,  of 
step  t. 


Theorem  12  Zero-initialized,  Q-values  are  consistent. 


Proof:  Q(s,a )  =  0  for  all  s  £  G  and  a  £  /4(s).  Since  U(s)  =  maxa6/4(s)  Q(s,a)  = 
ma,xa€A(s)0  =  0  for  all  s  £  S,  it  holds  for  all  s  £  S  \  G  and  a  £  A(s)  that  —  1  + 
U(succ{s,  a))  =  — 1  +  0  <  0  =  Q(s,a)  <  0. 


Theorem  13  (Minus  one) -initialized  Q-values  are  consistent. 


Proof:  Q(s,a)  =  0  for  all  s  €  G  and  a  €  ,4(s).  Since  U(s)  =  maxoe,4(s)  Q(s,a)  < 
maXag^fjjO  =  0  for  all  s  €  5,  it  holds  for  all  s  €  S  \  G  and  a  £  A(s)  that  —1  + 
U(succ(s ,  a))  <  — 1  +  0  =  Q(s,a)  <  0. 


Theorem  14  If  Q-values  are  consistent,  then  -  min ,iepG  d(s,s')  <  U(s)  <  0  for  all 
s  £  S,  where  PG  :=  (s  £  5  :  (J(s)  =  0}. 
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Proof  by  induction  on  d'(s)  :=  min s>€pg  d(s,s'): 


•  If  d'(s)  =  0,  then  s  £  PG  and  therefore  U(s)  =  0.  Thus,  —  d'(s)  =  U(s)  =  0. 

•  If  d'(s)  ^  0,  then  s  ^  PG  2  G  and  therefore  s  £  S  \  G.  Since  G  0,  it  holds 
that  PG  ^  0.  Let  a  :=  argmina<€j4(s)d,(succ(s,  a')).  Assume  that  the  theorem 
holds  for  all  s"  £  S  with  d'(s ")  <  d'(s).  Then,  it  holds  for  succ(s,  a)  as  well,  since 

.  _  _  v  .  assumption 

d'(succ(s,a ))  =  d{s)  —  1  <  d'(s).  Thus,  —d'(s)  =  — 1  —d'(succ(s,a))  <  —  1  + 

U (succ(si  a))  <  Q(s,a )  <  maxa-€/1(s)  Q{s,a')  =  U(s)  <  ma x,-eS,a'eAU)Q{s',a')  < 
0. 


Theorem  15  If  Q-values  are  consistent,  then  —1  —  min ,'^pg  d(succ(s,a),  s')  < 
Q(s,a )  <  0  for  all  s  £  S  \  G  and  a  £  A(s),  where  PG  :=  {s  €  S  :  U{s)  =  0}. 


Theorem  14 

Proof:  —1  —  minygpG  d(succ(s,  a),  s')  <  —  1  +  U{succ(s,a))  <  Q(s,a)  <  0  for  all 

s  £  S  \  G  and  a  £  A(s). 


Theorem  16  If  Q-values  are  consistent,  then  —gd(s)  <  U{s)  <  0  for  all  s  £  S. 


Theorem  14  Theorem  14 

Proof:  —gd(s)  =  —  mi n9'eod(s,s  )  <  -  mi ns'EPG  d(s,s')  <  U(s)  <  0  for  all 

s£  S. 


Theorem  17  If  Q-values  are  consistent,  then  —d  <  min ,zsU(s)  —  max,  e  sU(s). 


Proof  of  —  d  <  —d’(s)  <  U(s)  —  ma x3/^sP(s')  for  all  s  £  S  by  induction  on  d'{s)  := 
minyeG'  d(s,  s'),  where  G'  :=  {s  €  S  :  U{s)  =  max,<es  {/(s')}  ^  0.  Then,  —d  < 
min,65  U(s )  —  max,€s  U(s)  and  the  theorem  is  proved. 


•  If  d'(s)  =  0,  then  s  £  G'  and  —d'(s)  =  0  =  maxp^s  U (s')  —  maxygs  — 
U(s)  —  max,-  esU(s'). 

Theorem  14 

•  If  d'(s)  7^  0,  then  U(s)  <  max,'6s  U(s')  <  maxygsO  =  0  and  therefore 

s  #  G.  Assume  that  the  theorem  holds  for  all  s'  £  S  with  d'(s')  <  d'(s).  There 
exists  an  a  6  A(s)  such  that  d'(succ(s,a))  =  d'(s)  —  1  <  d'(s).  Then,  — d'(s )  = 

assumption 

—  1  —  d'(succ(s,a ))  <  —  1  4-  U(succ{s,a ))  —  max,'€s  U(s')  <  Q(s,a)  — 

ma x,>^sU{s')  <  maxa'€j4(,)  Q{s,a')  -  ma x,^sU(s')  =  U{s)~  max,'€s  U(s'). 
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Theorem  18  Consistent  Q -values  are  admissible. 


Proof:  Assume  consistent  Q-values.  Q(s,a)  =  0  for  all  s  £  G  and  a  £  A(s).  It  holds 

Theorem  16 

for  all  s  £  5  \  G  and  a  £  A(s)  that  —  1  —  gd(succ(s,  a))  <  — 1  +  U(succ(s,a))  < 

Q(s,a)  <  0. 


Theorem  19  Uopt{s)  =  —gd(s)  for  all  s  €  S. 


Proof:  According  to  Equations  2  and  the  fact  that  we  are  using  action-penalty  repre¬ 
sentation 


(  0  if  s  £  G 

U  p  (s)  =  <  max  (  —  Id -  Uopt(succ(s,a)))  otherwise 

f  aeA(s) 

Jo  if  seG 

—Up(s)  =  |  i|  min  —U^^suc^s^a))  otherwise 

I  a£A(s) 


for  all  s  €  S 

for  all  s  €  S 


(We  utilize  that  Uopt(s)  =  0  for  s  €  G:  Once  in  a  goal  state,  the  agent  can  execute 
only  one  action,  that  has  an  immediate  reward  of  0  and  does  not  leave  the  goal  state. 
Thus,  the  total  reward  of  the  continued  execution  of  this  action  is  0.  This  fact  does 
not  follow  from  Equations  2  if  7  =  1.) 

Comparing  this  to  the  definition  of  gd(s) 


gd{s )  =  rrun  d(s,sr) 


0 


if  s  =  s' 


min 

S 


|  1+  min  d(succ(s,  a),  s')  otherwise 

1  a€/l(5) 


0 


if  s  €  C 


14-  min  gd{succ(s,a))  otherwise 
a£  A(a) 


for  all  s  €  S 


shows  that  —Uopt(s)  —  gd(s)  for  all  s  £  5,  since  the  solution  of  the  set  ot  equations  is 
unique. 


Theorem  20  If  Q-values  are  admissible,  then  —gd(s)  <  U[s)  <  0  for  all  s  £  5. 

Proof:  If  s  £  G,  then  —gd(s)  =  0  =  maxa6.4(3)0  =  maxae/1(s)  Q(s,  a)  =  U(s).  It  holds 
for  all  s  £  5  \  G  that  —gd{s)  =  — (1  +  m\naeA(s)  gd(succ(s,a)))  =  maxa€4(S)(  — 1  — 
gd(succ(s,a)))  <  maxa€/t(s)  Q(s,  a)  =  U(s)  <  max4-es.0|e/t(s)l?(s,.a')  <  0. 
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Consider  the  following  algorithm:  Given  arbitrary  Q-values.  Pick  an  arbitrary  state 
seS\G  and  determine  a  :=  argmaxa,6/l(3)Q(s,a').  (Ties  can  be  broken  arbitrarily.) 
Set  Q(s, a)  :=  —1  +  f/(succ(s,a))  and  leave  the  other  Q-values  unchanged.  Refer  to 
the  old  Q-values  as  Q°(s,a )  and  to  the  new  ones  as  Q1(s,a),  i.e. 


Q V,0 


—  1  +  U°(succ(s',  a'))  if  s'  =  o  and  a'  =  a 
Q°(s',a')  otherwise 


for  all  s'  €  S  and  a1  6  A(s') 


Theorem  21  If  the  Q°  -values  are  consistent,  then 

1.  Ql(s',a')  <  Q°(s',  a')  for  all  s'  £  S  and  a'  €  A(s'), 

2.  U1(s')  <  U°(s')  for  all  s'  €  S,  and 

3.  the  Q1  -values  are  consistent. 

Proof: 

1.  Q1(s',al)  =  — 1  +  U°(succ(s' .  a'))  <  Q°(s',a')  for  s'  —  s  and  a'  =  a,  and 

a')  otherwise.  Thus,  Q1(s',a')  <  Q°(s',a')  for  all  s'  €  S 

and  a1  €  A(s'). 

2.  According  to  the  first  part  of  this  theorem,  it  holds  that  Q1(s',  a')  <  Q°(s',  a')  <  0 
for  all  s'  €  5  and  a'  €  A(s').  Then,  Ux{s')  =  max0/6/1(s/)  Ql(s',  a')  < 
maxa<6/i|3-)  (5°(5',a')  =  U°(s')  for  all  s'  €  S. 

3.  According  to  the  second  part  of  this  theorem,  it  holds  that  Ul(s')  <  U°(s')  for 
all  s'  €  S.  Then,  —1  +  Ul{succ{s',  a1))  <  — 1  +  U°(succ(s',a'))  =  Q1(s',a')  <  0 
for  s'  =  s  and  a'  =  a,  Q1(s',a')  —  Q°(s',a')  =  0  for  all  s'  6  G  and  a'  6  A(s'), 
and  —1  +  U1(succ(s'1  a'))  <  —  1  +  U°(succ(s',a'))  <  Q°(s',a’)  =  Q1(s',  a')  <  0 
otherwise.  Thus,  the  Qx-values  are  consistent. 


Theorem  22  If  the  initial  Q-values  are  consistent,  then  they  remain  consistent  after 
every  step  of  the  agent,  and  the  Q-values  and  U -values  are  monotonically  decreasing. 

Proof  by  induction  on  the  number  of  steps:  The  Q-values  are  consistent  before  the 
first  step.  Assume  that  they  are  consistent  before  an  arbitrary  step.  According  to 
Theorem  21,  they  are  consistent  after  the  step,  and  the  Q-values  and  U- values  are 
monotonically  decreasing. 


Consider  the  following  algorithm:  Given  arbitrary  Q-values.  Pick  an  arbitrary  state 
s  €  S  \  G  and  determine  a  :=  argmaxa,e4(3)Q(s,a').  (Ties  can  be  broken  arbitrarily.) 
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Set  Q(s,  a)  :=  min(Q(s,  a),  — 1 +  U ( succ(s ,  a)))  and  leave  the  other  Q-values  unchanged. 
Refer  to  the  old  Q-values  as  Q°(s,n)  and  to  the  new  ones  as  Q'-(s,a ),  i.e. 


Ql(s\a') 


min(  Q°(s',  a'),  —  1  +  U°(succ(s',a')))  if  s'  =  s  and  a'  =  a 
Q°(s',a')  otherwise 


for  all  s'  6  S  and 


Theorem  23  If  the  Q° -values  are  admissible,  then 


1.  Ql(s',a’)  <  Q°(s',  a')  for  all  s'  £  S  and  a'  €  >1(3'), 

2.  U\s')  <  U°(s')  for  alls'  €  S ,  and 

3.  the  Ql -values  are  admissible. 


Proof: 

1.  Q1(s,,a')  =  min(  Q°(s',  a'),  —  1  +  U°(succ(s' ,  a1)))  <  Q°(s',a')  for  s'  =  s  and 

a'  =  a ,  and  Q*(s',  a')  =  Q°(s',  a')  otherwise.  Thus,  Q1(s',a’)  <  Q°(s',a')  for  all 

s'  €  S  and  a'  6  >1(3'). 

2.  According  to  the  first  part  of  this  theorem,  it  holds  that  Q1(s\  a’)<Q°(s',a')<  0 
for  all  s'  E  S  and  a'  6  A(s').  Then,  Ul(s')  =  ma.xa>eA(sf)Q1(s',a')  < 
maxa-gj4(y)  Q°(s',a')  =  U°(s')  for  all  s'  €  S. 

Theorem  20 

3.  —1  —  gd(succ(s' ,a'))  <  Q°(s',af)  and  —1  —  gd(succ(s',a'))  <  —  1  + 

U°(succ(s',a'))  for  s'  =  s  and  a'  =  a,  therefore  —1  —  gd(succ(s',a'))  < 
min( Q°(s',  a'),  —  1  4-  U°(succ(s',a')))  =  Ql(s',a')  <  0  for  s'  =  s  and  a'  =  a. 
Q*(s',  a')  =  Q°(s',  a')  =  0  for  all  s'  6  G  and  a'  €  A(s'),  and  —  1  —  gd{succ(s’.  a '))  < 

Q°(s',a')  =  Ql(s',a')  <  0  otherwise.  Thus,  the  Q-values  are  admissible. 


Theorem  24  If  the  initial  Q-values  are  admissible,  then  they  remain  admissible  after 
every  step  of  the  agent,  and  the  Q-values  and  U -values  are  monotonically  decreasing. 

Proof  by  induction  on  the  number  of  steps:  The  Q-values  are  admissible  before  the 
first  step.  Assume  that  they  are  admissible  before  an  arbitrary  step.  According  to 
Theorem  23,  they  are  admissible  after  the  step,  and  the  Q-values  and  {/-values  are 
monotonically  decreasing. 


Theorem  25  For  all  steps  t  G  j\'o  (until  termination)  of  an  undiscount ed. 
admissible  Q-learning  algorithm  it  holds  that  (/‘(s')  +  £i€5  £ae.-M«)  Q°(3, a)  — 
<  >  T.,<:sT.*(iA[,)Qt(s,a)  +  0°(s°)  -  loop ‘  and  loop '  <  £,eS  E„€A(s)  Q°(®. a)  - 
EjgsEae/tf*)  Q^’a)’  where  loop ‘  :=  |{P  G  (0 . t  -  1}  :  s1'  =  st'+1}|  (the  num¬ 

ber  of  identity  actions  executed  before  t). 
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Proof  by  induction  on  t:  The  theorem  trivially  holds  for  t  =  0.  Assume  that  it  holds 
for  an  arbitrary  t.  Note  that  Qt(st,at)  =  {/‘(s*),  due  to  the  specific  action-selection 
step  used.  We  distinguish  two  cases: 


•  The  action  executed  at  t  is  an  identity  action: 

Then,  s*+1  =  sl  and  loopt+1  =  1  +  loop 1 .  Depending  on  the  value  update 
step  used,  it  holds  that  either  Qt+l(sl ,al)  =  — 1  +  U^s1)  =  — 1  +  Qt(st,at) 
or  Qt+l(st,at)  =  min(Qi(s',ait)!  “1  +  =  min(<2*(s‘,  a‘),  -1  +  Q*(s*,a*))  = 

—  1  +  Qt(s‘,at).  Thus,  in  both  cases  Qt+1(sl,  a1)  =  — 1  +  Qt(st,at).  f/t+1(s')  = 
niaxa6i4(s«)  Q*+1(s*,a)  ^  Qt+1(st,at)  =  — 1  +  Qt(si,at)  =  — 1  +  Ut(st).  All  other 
values  do  not  change  from  t  to  t  +  1. 

t/t+v+1)  +  £.€5 £.<*(.) Q°(*1«)  -  (<  +  1)  >  (-1  +  UW))  + 

assumption 

E«ss  L6.(j)  Q°(s,  a)-(t  +  l)  =  (U‘(st)  +  ZseS  ZaeMs)  Q°(s ,  a)  -  t)  -  2  > 

(EksL€xW9‘(«.«)  +  U°(s°)  -  loop1)  -  2  =  (-1  +  E*s5:.6*(.)  <?*(*«))  + 
U°(s°)  -  (1  +  loop4)  =  E,6S  Ea64(S)  Qt+1(s,a)  +  U°(s°)  -  loOPt+l . 

assumption 

loop1*1  =  1  +  loop f  <  1  +  (E,es  EaeA(s)  Q  (S,  a)  ~  Es€S  Ea£A(S)  Q*(S,  «))  = 

Es€S  Ea6/1(»)  Q°(s)  a)  —  (  — 1  +  YlsesHaeA(s)  Qt[S’a))  ~  J2sesJ2aeA(s)Q°(Sia)  ~ 

Es€S  Ea€4(s)  Qt+  (sia)- 

In  other  words,  the  theorem  also  holds  for  t  +  1. 

•  The  action  executed  at  t  is  not  an  identity  action: 

Then,  st+1  ^  sl,  loopt+1  =  loop K  and  (independent  of  the  value  update  step 
used)  Q^isfa1)  <  —  1  +  (/‘(s(+1)  =  — I  +  Ut+l(st+l).  All  other  values,  except 
for  £/(,s*),  do  not  change  from  t  to  t  +  1. 

Ut+'(st+l)  +  E.€sE.*t<.)G°(«,«)  -(«+!)*  (W)  +  EsgS  Eag.4(s)  Q°(a,a)  - 

assumption  _  ^ 

t) +  U‘(s‘)  -  1  >  (E,6s^M<J‘(s,a)  +  (/V)-/oop‘)  + 

(/"‘(s1*1)  -  rj‘iy>  - 1  =  (-1  +  t/«+l(3'+1))  -  £/<(«*)  +  E,esSa€J„iQ‘(s.«)  + 
U°l.S<‘)-loopt  >  (Q‘+'(s\a,)-Q\s\a‘nZ,is'E.zAMQ‘(s.<‘))+U°^°)-loopl  =■ 

(r.Mr.yi.)e't,(>.«))  +  W)  -  '<v+1- 

assumption  _  _ 

/oop<+1  =  loop1  <  L,asLa€A(,)Q°(S’a)  ~  E,€5  Eo€/t(4)  (*S’ a)  < 

E^esEae/K.jQ0^^)  -  E3esEae4<3)Qt+1(^a),  since  Q(+1(s,a)  <  Q*(s,a)  for 
all  s  €  5  and  a  €  A(s)  according  to  Theorems  22  and  24. 

In  other  words,  the  theorem  also  holds  for  t  +  1. 


The  following  theorem  is  a  simplified  version  of  the  previous  one. 

Theorem  26  For  all  steps  t,  €  No  (until  termination)  of  an  undiscounted . 
zero-initialized ,  and  admissible  Q-learning  algorithm  it  holds  that  Ul(sl)  —  t,  = 
Eags  Ea£4(j)  Q*(s, a)  If  the  state  space  has  no  identity  actions. 
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Proof  by  induction  on  t:  The  theorem  trivially  holds  for  t  =  0.  Assume  that  it 
holds  for  an  arbitrary  t.  st+1  ^  sl  and  (independent  of  the  value  update  step  used) 
Qt+1(5t,a<)  =  —  1  +  Ul(st+l)  =  —  1  +  f/*+1(s‘+1),  since  the  £()*-values  are  consistent 
according  to  Theorems  12  and  22  and  therefore  Q^s^a1)  >  —  1  +  Ut{succ(st,at)).  All 
other  values,  except  for  U(sl),  do  not  change  from  t  to  t  +  1.  Note  that  Qt(st,at)  = 
U f(s‘),  due  to  the  specific  action-selection  step  used. 

t/'+i(s'+,)-((+ 1)  =  (tfV)-‘)+v,+V+1)-vV)-i  £,esE.e4M  «'(*.<>)+ 

-  V\s<)  -  1  =  (-1  +  U‘+'(s‘+'))  -  WW)  +  E,6s£.€A(.)«'(^<l)  = 
Q'*\s\a')  -  Q‘(s‘,  a< )  +  E.6S  Q‘M  =  E,ss  E.S.4W  Qt+1(s*  a).  ■ 

In  other  words,  the  theorem  also  holds  for  t  +  1. 


Theorem  27  For  all  steps  t  £  jVq  (until  termination)  of  an  undiscounted,  admissible 
Q-leaming  algorithm  it  holds  that  t  <  U^s1)  —  U°(s°)  +  2  Ylses  Eaer4(s)(Q°(s?a)  — 
Q‘(s,a)). 


Theorem  25  _  A 

Proof:  t  <  £/V)  +  E,€s  EaeA(j)  Q°(^  <0  -  E,es  Ea^(s)  Q*M  -  U°(s°)  + 

Theorem  25  _  ^  _ 

loop ‘  <  UV)  +  Z,€sZ^amQ°M  -  -  (/°(s°)  + 

-  E*sE.e<(.]«'(».«))  =  V‘W)  +  2£.«E + 
2 E,€5 E.€4(.)  -QUo)  -  u°(s°)  =  l/V)  -  t 'V)  +  2r,es£aeAf„(Q°(s,a)  - 
Q‘(s,a)). 


Theorem  28  An  undiscounted,  admissible  Q-learning  algorithm  reaches  a  goal  state 
and  terminates  after  at  most  2^s€S\G  Eoe/!(s)(Q0(5>a)  +  gd(succ(s,a))  +  1)  —  U°(s°) 
steps. 


Theorem  27  ^  _ 

Proof:  I  <  (/•(»■)  -  t/“(»°)  +  2E,€SE«4m(Q°U.«)  -  QV, «))  <  -VV)  + 

2  E»es\G  Ea6/t(s)(Q°('s’ a)  +  gd(succ(s,a))  -f  1).  since  the  Q-values  are  admissible  ac¬ 
cording  to  Theorems  22  and  18  or  Theorem  24,  and  therefore  Q°(s,a)  =  Q^s^a)  =  0 
for  all  s  £  G  and  a  £  A(s),  —1  —  gd(succ[s,a))  <  Q‘(s,a)  for  all  s  £  5  \  G  and 

Theorem  20 

a  £  A(s),  and  Ul(s)  <  0  for  all  s  £  5. 


Theorem  29  An  admissible  Q-learning  algorithm  reaches  a  goal  state  and  terminates 
after  at  most  0(ed )  steps. 

Proof:  The  algorithm  reaches  a  goal  state  and  terminates  after  at  most 

O{2Z3(:S\Gf2aeM3)(Q0(s,a)  +  gd{succ{s,a))  +  1)  -  U°{s0))  <  0(2  Ej€s\g  Eae4(,)(^  + 
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1)  +  d)  <  0(2e(d  +  1)  +  d)  =  0(ed)  steps  according  to  Theorem  28,  since  the  Ca¬ 
valries  are  admissible  according  to  Theorems  22  and  18  or  Theorem  24,  and  therefore 

Theorem  20  _ 

Q°(s,a)  <  0  for  all  s  €  S  and  a  €  A(s),  and  —  d  <  —  gd(s)  <  U°{s)  for  all  s  £  S. 


A. 2  Finding  Optimal  Policies  with  Bi-Directional  Q- 

Learning  (Version  1) 

In  the  following,  we  consider  the  bi-directional  Q-learning  algorithm  (version  1)  as 
stated  in  Figure  20.  The  proofs  can  then  be  transferred  to  a  discounted,  admissi¬ 
ble  bi-directional  Q-learning  algorithm  with  action-penalty  representation  or  a  high- 
initialized,  discounted  bi-directional  Q-learning  algorithm  with  goal-reward  represen¬ 
tation  as  outlined  in  Chapter  4  for  the  (basic)  Q-learning  algorithm. 

The  time  superscripts  used  in  this  chapter  refer  to  the  values  of  the  variables  immedi¬ 
ately  before  the  action  execution  steps  of  the  bi-directional  Q-learning  algorithm,  i.e. 
lines  5  and  9,  of  step  t. 


We  define  a  “forward  phase”  of  the  bi-directional  Q-learning  algorithm  (version  1)  to 
be  a  largest  interval  of  steps  [ta,  tb }  C  M o  (with  ta  <  tb)  such  that  line  5  was  executed 
at  all  steps  t  £  [ta,tb  —  1].  Analogously,  we  define  a  “backward  phase”  to  be  a  largest 
interval  of  steps  [ta,tb]  C  Af0  (with  ta  <  tb)  such  that  line  9  was  executed  at  all  steps 

f€[*V6-l]. 

These  definitions  imply  that  the  agent  is  in  a  state  s1*  with  doneta(sta)  =  false  at  the 
beginning  of  a  forward  phase  [ ta ,  tb].  Then,  it  exclusively  executes  forward  steps,  until 
it  finally  reaches  a  state  sl  with  done 1  ( s 1  )  =  true.  The  opposite  holds  for  a  backward 
phase. 

The  bi-directional  Q-learning  algorithm  starts  with  a  forward  phase  if  s3tart  £  G,  then 
alternates  between  backward  and  forward  phases,  and  ends  with  a  backward  phase. 


Theorem  30  The  Qj-values  are  consistent  after  every  step  of  the  agent  and  are  mono- 
tonically  decreasing. 

Proof  by  induction  on  t: 

•  Initially,  Q/(s,a)  =  0  for  all  s  €  S  and  a  €  A(s).  Thus,  the  Qy-values  are 
consistent  according  to  Theorem  12. 

•  Assume  that  the  Q/-values  are  consistent  before  an  arbitrary  step.  The  only  line 
that  can  change  a  Qf- value  is  line  6.  If  line  6  is  executed,  then  the  Q/-values 
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remain  consistent  after  the  step  and  are  monotonically  decreasing  according  to 
Theorem  21.  If  line  6  is  not  executed,  then  the  (^/-values  remain  unchanged  and 
are  consistent  according  to  the  assumption. 


Theorem  31  Every  forward  phase  terminates. 

Proof  by  contradiction:  Assume  not.  The  (^/-values  before  the  first  step  of  any  forward 
phase  are  consistent  according  to  Theorem  30.  The  repeated  execution  of  a  forward 
step  implements  an  admissible  Q-learning  algorithm.  According  to  Theorem  29,  the 
agent  reaches  a  state  s  £  G  eventually,  and  then  sets  done(s ,  a)  :=  true  for  all  a  £  A(s) 
in  line  2.  Afterwards,  done(s)  =  true  and  the  next  step  executed  is  a  backward  step. 
This  terminates  the  forward  phase,  which  is  a  contradiction. 


Theorem  32  For  all  s  £  S  and  a  £  A(s),  it  holds  that:  once  done(s,a)  —  true,  it 
remains  true  after  every  step  of  the  agent  and  Qf(s,a)  remains  unchanged. 

Proof  by  contradiction:  Assume  not.  Then  done(s,a)  =  true  before  some  step,  but 
after  the  step  either  the  value  of  done(s,a)  or  Qf(s,a)  has  changed.  The  only  line  that 
can  set  done(s,a)  :=  false  or  change  the  value  of  Qf(s,a)  is  line  6  when  the  current 
state  equals  s.  It  is  only  executed  if  done{s)  =  false  when  line  3  is  executed.  This 
implies  that  done(s,a')  =  false  for  all  a'  £  A(s)  with  Qj(s,a')  =  max0»€.4(s)  Qj{s.  a"). 
But  then  done(s,a)  =  false  for  the  action  a  selected  in  line  4,  which  is  a  contradiction. 


Theorem  33  Foi  all  s  £  S,  it  holds  that:  once  done(s)  =  true,  it  remains  true  after 
every  step  of  the  agent  and  U/(s)  remains  unchanged. 

Proof  by  contradiction:  Assume  not.  Then  done(s)  =  true  before  some  step,  but  after 
the  step  either  the  value  of  done{s)  or  U/{s)  has  changed.  This  is  only  possible  if  a 
value  of  done(s,a )  or  Q/(s,a)  for  an  a  £  A(s)  has  changed.  The  only  line  that  can 
change  one  of  these  values  is  line  6  when  the  current  state  equals  s,  but  it  cannot  be 
executed,  since  done(s)  =  true  when  line  3  is  executed.  This  is  a  contradiction. 


Theorem  34  Every  forward  phase  [t°,t6]  increases  the  cardinality  of  the  set  D  := 
{(s,  a)  :  done(s ,  a)  =  true  V  done(s)  =  true.s  £  S,a  £  A(s)}  by  at  least  one. 

Proof:  donetb(stb)  =  true  per  definition  of  a  forward  phase.  We  distinguish  two  cases: 
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•  donetb  1(^i6)  =  false : 

^  0,  since  the  state  space  is  strongly  connected,  donetb~1(stb)  =  false 
implies  that  there  exists  an  a  €  A(s4  )  with  done ‘  ~1(st  ,a)  =  false.  Thus, 
(stb,a)  £  Dtb~l.  However,  (stb ,a)  G  jD-\  since  donetb(stb )  =  true. 


•  donetb  1(5<i>)  =  true: 

If  a  forward  step  is  executed  at  step  th  —  1  >  ta,  then  donetb~1(stb~1)  = 
false  (otherwise  a  backward  step  would  be  executed,  which  were  a  contra¬ 
diction).  Thus,  donetb~1(stb~1,atb~1)  =  false  (otherwise  donetb~l(stb~x)  — 

=  maxa<ea<,,‘6-i)  Q^'1  {stb~l ,  a')  Adone(stb~l ,  a))  =  true , 
since  Qtj~1(stb~1 ,  atb~1)  —  maxa,€/1(st6 ^1^Qtj~1(s‘b~1,a')  and  done{stb~l,  a*6-1)  = 

true ,  which  were  a  contradiction).  Thus,  (s<l’~1,a4b~1)  ^  Dtb~l.  However, 
(stb~1,atb~1)  G  Dtb ,  since  donetb(stb~1,atb~1)  =  donetb~1(succ(stb~1,atb~1))  = 
donetb~l(stb)  =  true. 


Theorem  35  For  all  s  G  S,  it  holds  that:  if  done{s)  =  true,  then  U j(s)  =  — gd(s )  = 
Uf{s). 

Proof  by  induction  on  t:  Initially,  done(s)  =  false  for  all  s  6  S  and  the  theorem  holds 
trivially.  We  distinguish  two  cases: 

•  gd(s)  —  0,  i.e.  s  G  G: 

Initially,  U/{s)  =  ma xaeA(s)Qf{s,a)  =  maxae/i(s)  0  =  0  =  -gd(s)  Th'=m19 
Uopt{s).  The  only  line  that  can  change  a  Qj(s,a)  value  for  a  €  A(s)  is  line  6 
when  the  current  state  equals  s,  but  it  cannot  be  executed,  since  done(s.a)  is 
set  to  true  for  all  a  €  /l(s)  in  line  2  and  thus  done(s)  =  true  when  line  3  is 
executed.  Thus,  the  Qj(s,a)  values  cannot  be  changed  for  all  a  €  A(s),  and 
Uf(s)  =  maxa€,4(s)  Qf(s,a)  —  maxa€/4(s)0  =  0  =  — gd(s )  =  Ujpt(s)  continues  to 
hold  after  every  step  of  the  agent  (no  matter  what  the  value  of  done(s)  is). 

•  gd(s)  ^  0,  i.e.  s  €  S\G: 

If  done(s)  never  becomes  true,  the  theorem  holds  trivially  for  s  €  S.  Oth¬ 
erwise,  let  t  :=  argmintiejV-0(donet  (s)  =  true )  and  assume  that  the  theo¬ 
rem  holds  for  all  steps  smaller  than  t.  Because  donel(s)  =  true ,  there  ex¬ 
ists  an  a  €  A{s)  with  Qlj(s,a)  =  max^6/t(,)  Q/(s,  a')  and  donel(s,a)  =  true. 
Since  initially  done(s,a)  =  false ,  there  exists  a  step  t‘  G  A o  with  t'  <  t 
and  done1  (s, a)  =  false,  but  donet+l(s,a)  =  true.  Line  6  was  executed 
at  step  t' ,  since  this  is  the  only  way  to  set  done(s.a)  :=  true  for  a  state 
s  £  G.  Then,  donet+l(s,a )  =  done*  (succ(s,a))  =  true  and  Q‘j+l{s,a)  = 
—  1  +  Ulj (sttcc(s, a)).  —1  —  gd(succ(s,a))  <  maxa-6>1(i)(  — 1  —  gd(succ(s,a')))  = 
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Theorem  16 

— (l  +  miive A(,)gd(succ(s,a')))  =  -gd{s)  <  Uj{s)  =  max*-^,)  Q}(s,  a')  = 

<2/(s,a)  <  Q/+1(s,a)  =  -1  +  Uj  (succ(s,a))  la3U?:p“°n  _ l  —  gd(succ(s,a)), 

since  the  ^/-values  are  consistent  at  step  t  according  to  Theorem  30.  Thus, 
equality  holds  and  Uj(s)  =  —gd(s)  The  =m  19  Ujpt(s).  Since  donet(s)  =  true,  U/(s) 
remains  unchanged  after  every  step  of  the  agent  and  U/{s)  =  —gd(s)  =  UjPt(s) 
continues  to  hold  according  to  Theorem  33. 


Theorem  36  For  all  s  £  S\G  and  a  €  A(s)),  it  holds  that:  if  done(s,  a)  =  true,  then 
Qj(s,a )  —  —l  —  gd(succ(s,  a)). 

Proof:  Initially,  done(s ,  a)  =  false.  If  done(s,  a)  never  becomes  true,  the  theorem  holds 
trivially  for  s  £  S\G.  Otherwise,  let  t  :=  argmint€^/-0(donet(s,a)  =  true).  Since  s  ^  G, 
the  only  line  that  can  set  done(s,a)  :=  true  is  line  6  when  the  current  state  equals  s. 
Then,  donet(s,a )  =  donet~1(succ(s,a))  and  Qlj[s,a)  =  — 1  +  U'f1  [succ(s,  a)).  Since 
donet~1(succ(s,  a))  =  done‘(s,a)  =  true,  Theorem  35  asserts  that  U}~1(succ(s,  a))  = 
—gd(succ(s,a)).  It  follows  that  Q/(s,a)  =  —  1  —  gd(succ(s,  a)).  Since  donees,  a)  = 
true,  Qj(s,a)  remains  unchanged  after  every  step  of  the  agent  and  Qf(s,a)  =  —  1  — 
gd(succ(s,a))  continues  to  hold  according  to  Theorem  32. 


Theorem  37  There  is  a  maximum  of  e  forward  phases  and  e  -f  1  backward  phases. 

Proof:  According  to  Theorem  34,  every  forward  phase  increases  the  size  of  the  set 
D  :=  {(s,  a)  :  done(s,a )  =  true  V  done{s )  =  true,s  £  S,  a  £  A(s)}  by  at  least  one. 
Since  \D\  =  e,  there  can  be  at  most  e  forward  phases.  Since  forward  and  backward 
phases  alternate,  there  are  at  most  e  +  1  backward  phases. 


Theorem  38  All  forward  phases  together  execute  at  most  0(e  x  ub(d))  steps. 

Proof:  There  are  at  most  e  forward  phases  according  to  Theorem  37.  Let  there  be 
the  forward  phases  [if, for  i  £  {0,1,..., e'}  with  e!  <  e.  Q/(s,a)  =  Q/+l(s,a) 

and  Uj' (s)  =  Uj'+1(s)  for  all  i  £  {0, 1, . . . ,  e'  —  1},  s  £  S ,  and  a  £  A{s),  since 
only  forward  steps  can  change  Qf- values.  According  to  Theorem  31,  every  forward 
phase  terminates.  Theorem  27  asserts  that  the  forward  phase  [i “,  t •]  needs  at  most 

Ujis*’)  -  Uj’is1*)  +  2J2,es'La€A(s)(Qtf  («-a)  ~  Q) (•*,«))  steps  to  terminate.  The  fol¬ 
lowing  relationships  hold,  since  the  Q /-values  are  consistent  after  every  step  of  the  agent 

Theorem  16  Theorem  16 

according  to  Theorem  30  and  G  ^  0:  —d  <  —gd(s)  <  Uj{s)  <  0  for  all 
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s  £  5,  —  1  —  d  <  — 1  —  gd(succ(s,a))  <  Qj(s,a)  <  0  for  all  s  £  S\  G  and  a  £  A(s),  and 
—  1  —  d  <  0  =  Q/(s,  a)  for  s  £  G  and  a  £  A(s).  Thus,  all  forward  phases  together  termi¬ 


nate  after  at  most  £fl0  -  tfis*?)  +  2£s€S EaeWQ/ («*<*)  ~  £?(*,«)))  = 


E&o (ufist)  -  U?(s*))  +  2Zs esL£AW(^,fl)  -  Qf(s,a))  <  Ei=o(0  +  d))  + 


2  !C«es  Eae,4(3)(0  +  (d  +  1))  <  ed  +  2e(d  +  1)  <  3e(d  4-1)  <  3 e(ub(d)  4-  1)  steps  in 
total,  which  are  0(e  x  ub(d))  steps. 


Theorem  39  The  Qb-values  are  consistent  after  every  step  of  the  agent  and  are  mono- 
tonically  decreasing. 

Proof  by  induction  on  t:  Define  Gb  ■=  {s  £  S\G  :  done(s)  =  false}.  Thus,  Gb  changes 
during  execution. 

•  Initially,  Qb(s,a)  =  0  for  all  s  £  S  and  a  £  A(s).  Thus,  the  Qb- values  are 
consistent  according  to  Theorem  12. 

•  Assume  that  the  Qb-values  are  consistent  for  the  set  of  goal  states  G\  before  an 
arbitrary  step  t  + 1.  Note  that  if  Q- values  are  consistent  for  a  set  of  goal  states  G, 
then  they  are  also  consistent  for  any  subset  G'  C  G.  According  to  Theorem  33, 
it  holds  that  GJ+1  C  G[.  Thus,  the  Q£- values  are  consistent  for  the  set  of  goal 
states  Gj+1  as  well. 

The  only  line  that  can  change  a  Qb-value  is  line  10.  If  line  10  is  not  executed, 
then  the  Qb-values  remain  unchanged  and  therefore  remain  consistent  for  G&+1.  If 
line  10  is  executed,  then  donees*)  =  true  (otherwise  line  3  would  have  transferred 
control  to  a  forward  step)  and  therefore  sl  &  G\  D  G\+l .  Then,  the  Qb-values 
remain  consistent  for  G£+1  after  the  step  and  are  monotonically  decreasing  ac¬ 
cording  to  Theorem  21. 


Theorem  40  Uf{s)  =  —gd(s)  =  Uopt{s),  done(s)  =  true ,  and  Ub{s)  <  0  for  all  s  £  S 
after  termination  if  the  bi-directional  Q-learning  algorithm  (version  1)  terminates. 

Proof:  Let  t  be  the  time  superscript  for  the  final  values  of  the  variables.  Then, 
^bl5*-1)  <  —ub(d)  <  —d,  since  the  condition  in  line  11  is  true  at  step  t.  Accord¬ 
ing  to  Theorem  39,  the  Qj-values  are  consistent.  Theorem  17  states  that  —  d  < 
min,/es  Gl(s')  -  ma x,^s  Ul(s')  <  U^s1-1)  -  max,/€5  Ul(s')  <  -d  -  max,,€5  U£(s'). 
Thus,  0  =  —d  +  d  <  — d  —  maxs'€s  f/fc‘(s')  +  d  =  —  maxs/es  Ul(s').  Consider  an  arbitrary 
s  £  S.  Since  max,<es  Ul(s')  <  0,  it  must  hold  that  [/^(s)  <  0,  but  initially  C4(s)  =  0. 
Thus,  all  values  Qb(s,a )  for  a  £  A(s)  have  changed.  Since  only  line  10  can  change 
these  values,  it  is  executed  at  least  once  with  the  current  state  equal  to  s.  At  this 
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point  in  time,  done(s)  =  true  (otherwise  line  10  could  not  have  been  executed)  and, 
according  to  Theorem  33,  donel(s)  =  true.  Then,  Uj{s)  =  —gd(s)  =  Ujpt(s)  according 
to  Theorem  35. 


Theorem  41  It  is  an  optimal  policy  to  select  action  argmax aeA^U/(succ(s,  a))  or, 
equivalently,  CLrgmaxa&A(3^done(3a^-tTUCQ]{s,a)  in  state  s  €  S\G  after  termination  if 
the  bi-directional  Q-leaming  algorithm  (version  1)  terminates  (“correctness”). 

Proof:  Let  t  be  the  time  superscript  for  the  final  values  of  the  variables.  We  prove 
both  parts  of  the  theorem  separately: 

•  Consider  an  arbitrary  s  €  S  \  G  and  any  a  6  A(s)  with  U)(succ{s,a))  = 
maxa>£A{3)U}(succ(s,a')).  Since  Uj(s')  =  -gd(s')  for  all  s'  G  S  according  to 
Theorem  40,  it  holds  that  gd(succ{s,  a))  =  mina/6^s)  gd(succ{s,  a')).  Thus,  it  is 
optimal  to  execute  a  in  5. 

•  Now,  consider  an  arbitrary  s  €  S  \  G  and  any  a  6  A(s)  with  donees,  a)  = 

true  and  C^(s,a)  =  maxa»g/i|3))4,ne.(s,a»)=ir,e  a").  Since  donees)  = 

true ,  there  exists  an  a'  €  A(s )  with  donees,  a')  =  true  and  Q^(s,a')  = 
maxa"£^4(s)  Q^{s,  a  ).  Thus,  Qj(s,a)  =  mwXa"£A(3),done‘(s,a")=trueQ  j(s,a  ) 
Q^(3,a')  =  maxa»€^(jJ)  Q^s^a")  >  Qtf(s,am)  for  all  a'"  e  A(s).  Since 
done‘(s,a)  —  true ,  Q‘j(s,a)  =  —1  —  gd(succ(s,  a))  according  to  Theorem  36. 
Furthermore,  the  values  are  consistent  according  to  Theorem  30  and  there¬ 
fore  —  1  —  gd(succ(s,a"'))  <  Qy(s,a'").  Thus,  gd(succ(s,  a))  =  — 1  —  < 

—  1  —  Qy(s,a'")  <  gd(succ(s,  a'"))  for  all  a"'  E  A(s).  Since  gd(succ{s,a))  = 
mina»gy4(a)5d(succ(s,  a")),  it  is  optimal  to  execute  a  in  s. 


Theorem  42  Every  backward  phase  terminates. 

Proof  by  contradiction:  Assume  not.  Then  there  must  a  state  s  that  is  visited  infinitely 
often.  We  call  the  sequence  of  states  between  two  occurrences  of  state  s  a  cycle. 
Theorem  39  asserts  that  the  (^-values  before  the  first  step  of  the  backward  phase  are 
consistent.  The  repeated  execution  of  a  backward  step  implements  an  admissible  Q- 
learning  algorithm.  Thus,  the  ^-values  are  monotonically  decreasing.  Furthermore, 
for  every  cycle  that  the  agent  completes,  the  largest  Q;,-value  of  the  actions  executed 
in  the  cycle  decreases  by  at  least  one.  Eventually,  the  Qb-v alues  of  all  actions  executed 
in  the  cycle  drop  below  every  bound,  so  do  the  {/(,-values.  In  particular,  they  drop 
below  — ub(d ),  the  condition  in  line  11  is  satisfied,  and  the  backward  phase  terminates, 
which  is  a  contradiction.  (For  more  details,  see  similar  arguments  in  the  context  of 
RTA*-type  search  by  [17]  or  [25].) 
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Theorem  43  All  backward  phases  together  execute  at  most  0(e  x  ub(d))  steps. 


Proof:  There  are  at  most  e  +  1  backward  phases  according  to  Theorem  37.  Let  there 
be  the  backward  phases  for  i  €  {0,1,...,  e'}  with  e'  <  e  +  1,  where  we  set  t\, 

to  the  step  before  termination  (i.e.  one  smaller  than  it  should  be).  This  simplifies  the 
notation  in  the  following. 


Q£*(s,a)  =  Qj+1(s,a)  and  Ul'(s)  =  Ul'+1{s)  for  all  i  e  {0,1,..., e'  -  1},  s  6  5, 
and  a  €  A(s),  since  only  backward  steps  can  change  (^-values.  According  to  The¬ 
orem  42,  every  backward  phase  terminates.  Theorem  27  asserts  that  the  backward 

phase  needs  at  most  -  ul'  (s1?)  +  2  £ie5  £a€/t(*)(<?6  (*,  a)  -  ^'(s,a)) 

steps  to  terminate.  Note  that  —ub(d)  <  Ub(s)  <  0  for  all  s  €  S  and  all 
steps  but  the  final  step  (otherwise  the  algorithm  would  have  terminated  earlier), 

and  therefore  —1  —  ub(d)  <  —  1  —  Ub‘’ (succ(s,a))  <  Qb'(s,a )  for  all  s  €  S  \  G 
and  a  €  A(s).  Thus,  all  backward  phases  together  terminate  after  at  most 

ZlLo(ulhs‘‘)-ul‘(s‘r)  +  2z.€sZ^M.,(Q‘iM-Q‘h^<‘n)  =  Ef=o (C,V)  - 

£to(0+»i(«i))+2E.€5i:.6A„)(0+ 
( ub(d )  +  1))  <  (e  +  1)  x  ub(d)  -f  2e(ub(d)  +  1)  <  3(e  +  l)(u6(d)  +  1)  steps  in  total  (plus 
the  last  step),  which  are  0(e  x  ub(d ))  steps. 


Theorem  44  The  bi-directional  Q-leaming  algorithm  (version  1)  finds  an  optimal  pol¬ 
icy  and  terminates  after  at  most  0(e  x  ub(d))  steps. 

Proof:  According  to  Theorems  38  and  43,  the  forward  and  backward  phases  together 
execute  at  most  0(e  x  ub(d ))  steps.  According  to  Theorem  41,  the  bi-directional  Q- 
learning  algorithm  terminates  with  an  optimal  policy. 


A. 3  Finding  Optimal  Policies  with  Bi-Directional  Q- 
Learning  (Version  2) 

In  the  following,  we  consider  the  bi-directional  Q-learning  algorithm  (version  2)  as 
stated  in  Figure  21. 

The  time  superscripts  used  in  this  chapter  refer  to  the  values  of  the  variables  immedi¬ 
ately  before  the  action  execution  steps  of  the  bi-directional  Q-learning  algorithm,  i.e. 
lines  7  and  12,  of  step  t. 


We  define  S  to  be  the  set  of  states  that  the  agent  has  already  explored,  i.e.  5‘  :=  {s  € 
S  :  =  s}.  Let  A(s)  C  A(s)  be  the  set  of  actions  in  s  that  the  agent  has  not  yet 
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explored,  i.e.  A*(s)  :=  {a  €  >1(5)  :  -'3^/<^(5',  =  s  A  a1'  —  n)}.  Note  the  asymmetries: 
First,  S  is  the  set  of  explored  states,  but  A(s)  is  a  set  of  unexplored  actions.  This 
asymmetry  simplifies  the  notation.  Second,  t'  <  t  in  the  definition  of  5‘,  but  t'  <  t  in 
the  definition  of  A4(s).  This  is  so,  because  a  step  ends  with  the  execution  of  an  action 
(according  to  the  definition  of  “step”).  Thus,  the  current  state  of  the  agent  before  the 
action  execution  is  already  explored,  since  the  agent  is  already  in  the  state.  However, 
the  action  selected  for  execution  is  not,  since  it  has  not  been  executed  yet.  Thus,  we 
say  that  an  action  that  has  not  been  executed  before  step  t,  but  is  executed  at  step  t, 
was  unexplored  (before  and)  at  step  t  and  is  explored  at  step  t  +  1. 


Theorem  45  For  all  s  €  S,  a  G  A(s),  and  steps  t  6  -Vo  (until  termination),  it  holds 
that:  Q(ffs,a)  =  0  iff  a  has  never  been  executed  in  s  in  a  forward  step  before  step  t,  i.e. 
for  all  t'  <t  it  has  never  been  true  that  s1'  =  s,  a1'  =  a,  and  donet'(s)  =  false. 

Proof:  Initially,  Q°j(s,a)  =  0.  Only  the  execution  of  a  in  s  in  a  forward  step  can  change 
Q/(s,a).  Thus,  if  a  has  never  been  executed  in  s  in  a  forward  step  before  step  t,  then 
Qlj(s,  a)  =  0.  However,  if  a  has  been  executed  in  s  in  a  forward  step  at  least  once,  say 

at  step  t'  <  t,  then  Q^s.a)  <  Qj+l(s,a)  =  —  l+U)  (succ(s,a))  <  —1+0  = 

-1  <  0. 


Theorem  46  For  all  s  €  S ,  a  €  A(s),  and  steps  t  €  A/ o  (until  termination ),  it  holds 
that:  Ql(s,a )  =  0  iff  a  has  never  been  executed  in  s  in  a  backward  step  before  step  t. 
i.e.  for  all  t'  <  t  it  has  never  been  true  that  sl  =  s,  a1  —  a,  and  done 1  (s)  =  true. 

Proof:  Initially,  Q°(s,a)  =  0.  Only  the  execution  of  a  in  s  in  a  backward  step  can 
change  Qi,(s,a).  Thus,  if  a  has  never  been  executed  in  s  in  a  backward  step  before  step 
t,  then  Q‘b(s,a )  =  0.  However,  if  a  has  been  executed  in  s  in  a  backward  step  at  least 

Theorem  30  ,  _  t  Theorem  16 

once,  say  at  step  P  <  f,  then  Qj(s,  a)  <  Qb+  (s.a)  =  —  1  (succ(s,a))  < 

-1  +0  =  -1  <  0. 


Theorem  47  For  all  s  €  S  and  steps  t  6  -Vo  (until  termination),  it  holds  that: 
Qtj+1(s,a)  =  Ql+1(s,a)  =  0  for  all  a  €  A(s)  iff  s  is  unexplored  at  step  t,  i.e.  s  £  5‘. 

Proof:  If  Qtj+l(s,a)  =  Ql+l(s,a)  =  0  for  all  a  6  A(s),  then,  according  to  Theorems  45 
and  46,  no  a  €  A(s)  has  ever  been  executed  in  s  before  step  t  +  1  (neither  in  a 
forward  step  nor  in  a  backward  step).  Thus,  s  £  S\  because  otherwise  the  agent 
had  been  in  state  s  at  step  t  or  before  and  would  have  had  to  execute  an  action  in 
it,  which  is  a  contradiction.  Likewise,  if  s  ^  5',  then  the  agent  has  never  been  in 
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5  at  step  t  or  before,  and  thus  had  no  chance  to  execute  an  action  in  it.  Therefore, 
Qy+1(s,u)  =  Qj+1(s,a)  =  0  for  all  a  E  A(s)  according  to  Theorems  45  and  46. 


Theorem  48  For  all  steps  t  E  Ao  (until  termination ),  memory 1  —  |{(s,a)  :  s  G 
5*  ri  G,a  E  A\s)}\  +  \{s  e  S*  \  G  :  -<W(s)}|. 

Proof  by  induction  on  t: 

•  Before  the  execution  of  the  algorithm  (i.e.  at  “step  —1”),  memory  =  0  and  5  =  0. 
Thus,  memory  —  0  =  |{(s,a)  :  s  G  5  fl  G,a  E  A(s)}|  +  |{s  G  5  \  G  :  ~-'done(s)}|. 

•  Assume  that  the  equation  holds  for  step  t. 

The  only  lines  that  can  change  the  value  of  memory  are  line  3,  line  9,  and  line  14. 
Likewise,  there  are  only  three  ways  how  |{(s,a)  :  s  G  5  fl  G,a  G  A(s)}|  +  Us  G 
5\G  :  ->done(s)}|  can  change  between  steps  t  and  t  +  1:  5  can  change,  done(s) 
for  one  or  more  s  E  5*  \  G  can  change,  or  A(s)  for  one  or  more  s  E  St  0  G  can 
change.  We  show  that  three  equivalence  relationships  hold: 

-  The  body  of  the  condition  on  line  3  is  executed  iff  5  has  changed  between 
steps  t  and  t  +  1. 

If  the  body  of  the  condition  on  line  3  is  executed,  then  <5/+1(st+1,  a)  = 
Ql+1(st+1,  a)  =  0  for  all  a  €  A(si+1),  i.e.  st+1  ^  S‘  according  to  Theorem  47. 
Since  si+1  G  5t+1,  5  has  changed  between  steps  t  and  t  +  1.  After  the 
execution  of  line  3,  memory  is  increased  by  |A(st+1)|  if  sf+1  G  G ,  otherwise 
by  one. 

Assume  now  that  5  has  changed  between  steps  t  and  t  +  1,  i.e.  st+1  5‘ 
and  St+1  =  5*  U  {st+1}.  Thus,  no  action  has  yet  been  executed  in  st+1,  i.e. 
At+1(st+1)  =  A(s(+1),  and  therefore  <3/+1(5t+1,a)  =  Ql+l(st+l,  a)  =  0  for  all 
a  G  A(si+1)  according  to  Theorem  47.  Thus,  the  body  of  the  condition  on 
line  3  is  executed.  If  st+1  G  G,  then  ){(s,  a)  :  s  G  5  fl  G,  a  G  A(s)}|  +  |{s  G 
S\G  :  -'done(s)}|  increases  by  |A(s*+1)|,  since  |At+l(st+1)|  =  |A(st+1)|.  If 
st+l  ^  G,  then  |{(s,a)  :  s  E  S  C\G,a  E  A(s)}|  +  j{s  G  5  \  G  :  -’done(s)}| 
increases  by  one,  since  donet+l($t+l)  =  false:  initially  done(st+l)  =  false, 
only  line  8  can  change  done(st+1)  (because  sJ+1  ^  G),  and  line  8  cannot 
have  been  executed  (because  sf+1  ^  S‘). 

—  The  body  of  the  condition  on  line  9  is  executed  iff  done(s)  for  one  or  more 
s  E  Sl\G  has  changed  between  steps  t  and  t  +  1. 

If  the  body  of  the  condition  on  line  9  is  executed,  then  done(s£)  has  changed. 
sl  E  Sl  \  G,  because  otherwise  line  2  would  set  done(s')  :=  true  and  line  5 
would  give  control  to  a  backward  step,  which  is  a  contradiction.  Thus, 
done(s)  for  one  or  more  s  E  S*  \G  has  changed  between  steps  t  and  t  -f  1. 
After  the  execution  of  line  9,  memory  is  decreased  by  one. 
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Assume  now  that  done(s)  for  one  or  more  s  6  S*  \G  has  changed  between 
steps  t  and  t  +  1.  Only  line  8  can  change  done(s)  for  s  £  G.  Thus,  done(s') 
has  changed,  and  the  body  of  the  condition  on  line  9  is  executed.  At  most 
one  done(s )  for  s  €  S(\G  can  change  between  steps  t  and  i  +  1,  because  line  8 
is  the  only  line  that  can  change  such  a  value.  According  to  Theorem  33, 
done(s)  can  only  change  from  false  to  true  for  all  s  €  S.  Thus,  |{s  c 
S*\G:  -|done(s)}|  is  decreased  by  one,  since  done(s{)  has  changed  from 
false  to  true ,  but  all  other  done(s)  stayed  the  same. 

—  The  body  of  the  condition  on  line  14  is  executed  iff  A(s)  for  one  or  more 
s  €  S'  fl  G  has  changed  between  steps  t  and  t  +  1. 

If  the  body  of  the  condition  on  line  14  is  executed,  then  s'  €  S'  fl  G  and 
Qb(s\at)  has  changed  from  zero  to  non-zero,  a '  ^  A'+1(s'),  since  a‘  was 
executed  in  s'  in  a  backward  step  at  step  t.  According  to  Theorem  30,  the 
(Rvalues  are  admissible,  which  implies  that  Qy(s,a)  =  0  for  all  s  €  G 
and  a  €  A(s).  In  particular,  Q‘^(s',a‘)  =  0.  Then.  Q£(s',a‘)  =  0  implies 
a'  €  A'(s')  according  to  Theorems  45  and  46.  Since  a'  €  A'(s'),  but  a'  ^ 
A'+1(s'),  A(s)  for  one  or  more  s  €  S'  fl  G  has  changed  between  steps  t  and 
t  -f  1.  After  the  execution  of  line  14,  memory  is  decreased  by  one. 

Assume  now  that  A(s)  for  one  or  more  s  €  S'flG  has  changed  between  steps 
t  and  t+l.  Since  only  line  12  can  execute  actions  in  a  goal  state  s  (line  2  sets 
done(s)  :=  true  and  thus  line  5  always  gives  control  to  a  backward  step),  it 
follows  that  a1  €  A'(s')  and  A'+1(s')  =  A'(s')  \  {a‘},  all  other  A(s)  stay  the 
same.  Since  a'  €  A'(s'),  Q£(s',a')  =  0  according  to  Theorem  46.  Since  a‘ 
was  executed  in  s'  in  a  backward  step  at  step  t,  Qb+1(s',  a‘)  ^  0  according  to 
Theorem  46.  Then,  Qbis^a1)  changed  from  zero  to  non-zero  and  the  body 
of  the  condition  on  line  14  is  executed.  Also.  |{(s,a)  :  s  6  ^'PlG.  a  G  A(s)}| 
is  decreased  by  one. 

Since  the  equation  stated  in  the  theorem  holds  for  step  t  according  to  the  as¬ 
sumption  and  we  have  shown  that  between  steps  t  and  step  t  +  1  the  left-hand 
side  of  the  equation  changes  as  much  as  its  right-hand  side,  it  follows  that  the 
equation  of  the  theorem  holds  for  step  t  -I-  1  as  well. 


Theorem  49  0  <  memory '  <  e  for  all  steps  t  €  Aj’o  (until  termination). 


Proof:  According  to  Theorem  48,  it  holds  that  memory '  =  |{(s,a)  :  s  €  S*  fl  G, a  € 
y4*(-s)}|  +  K-s  €  5'  \  G  :  ->donel(s)} |.  0  <  |{(^,a)  :  s  6  5'  D  G,  a  6  A'(s)}|  +  |{s  6 
S*\G  :  -done'(s)} |  <  £,eG  |A(s)|  +  |S\ G|  <  £  «€S  |A(s)|  =  e,  since  the  state  space  is 
strongly  connected  and  therefore  |A(s)|  >  1  for  all  s  6  5,  i.e.  |5\  G|  <  £,e5\G  |A(s)|. 
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Theorem  50  U/(s)  =  —gd(s)  =  U)pt(s)  and  done(s)  =  true  for  all  s  £  S  after 
termination  if  the  bi-directional  Q-learning  algorithm  (version  2)  terminates.  Further¬ 
more,  it  is  an  optimal  policy  to  select  action  argmax aeA^U j(succ(s,  a))  or,  equivalently, 
argmaxaeA(s)done(s  a)..trueQ  f(s,a)  in  state  s  £  S\G  after  termination  (  “correctness”). 

Proof  by  contradiction:  Let  t  be  the  time  superscript  for  the  final  values  of  the  variables. 
Assume  that  it  does  not  hold  that  doneHs)  =  true  for  all  s  £  S.  Then,  there  exists  an 
s  £  $  with  donet(s )  =  false.  We  distinguish  two  cases: 

•  s  £  Su. 

s  £  S*  and  done'(s)  =  false  implies  according  to  Theorem  48  that  memory f  >  1. 
However,  memory  =  0  upon  termination,  which  is  a  contradiction. 

•  s  £  S': 

s  can  be  reached  from  every  state  n  Sl  ^  0,  since  the  state  space  is  strongly 
connected.  Thus,  A'(s')  ^  0  for  at  least  one  s'  £  S'.  Assume  a  £  A'(s').  If 
s'  £  G,  then  memory '  >  1  according  to  Theorem  48,  since  |A'(s')|  >  0.  Assume 
s'  g  G.  Then,  -gd(s')  <  0  Th'°=^6  Ql(s',a)  <  maxoW)  Ql(s',  a')  =  C/ft  s')- 
This  implies  according  to  Theorem  35  that  done‘(s')  =  false.  Thus,  memory '  >  1 
according  to  Theorem  48,  since  s'  €  S\  s'  ^  G ,  and  donees')  =  false,  in  both 
cases,  memory 1  >  1.  However,  memory  =  0  upon  termination,  which  is  a 
contradiction. 

Thus,  donel(s)  =  true  and,  according  to  Theorem  35,  Uj(s)  =  — gd(s )  =  UJpt(s) 
for  all  s  £  S.  The  proof  of  Theorem  41  relies  only  on  this  fart.  Thus,  according 
to  Theorem  41,  it  is  an  optimal  policy  to  select  action  argmaxa64(s)t/}(succ(s,a))  or, 
equivalently,  argmaxae/t(il)idone(3-a)=irtte(5t/(s,a)  in  state  s  £  S\G. 


Theorem  51  The  bi-directional  Q-learning  algorithm  (version  2)  terminates  at  the 
earliest  time  t  at  which  frj(s'_1)  <  -d  or  before. 

Proof:  Assume  that  £4(s)  <  —d  after  the  value  updating  in  a  backward  step.  Note 
that  this  is  exactly  the  situation  in  which  the  bi-directional  Q-learning  algorithm  (ver¬ 
sion  1)  terminates.  The  following  two  things  follow.  First,  according  to  Theorem  40, 
done(s')  =  true  for  all  s'  £  S.  Second,  also  according  to  Theorem  40,  Ub(s')  <  0  for  all 
s'  £  S.  This  implies  that  Qb{s',  a)  <  maxa<e4(sq  Qb{s\  a')  =  £4(s')  <  0  for  all  s'  £  S 
and  a  £  A(s').  Thus,  all  actions  have  been  executed  at  least  once  according  to  Theo¬ 
rem  46.  Put  together,  it  follows  that  done(s')  =  true  and  A(s')  =  0  for  all  s'  £  S.  Once 
done(s')  —  true ,  it  remains  true  according  to  Theorem  33.  Similarly,  A(s  )  stays  0  once 
it  is  empty.  Thus,  next  time  line  4  is  re<  :hed,  memory  =  0  according  to  Theorem  48. 
and  the  algorithm  terminates. 
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Theorem  52  The  bi-directional  Q-learning  algorithm  ( version  2)  finds  an  optimal  pol¬ 
icy  and  terminates  after  at  most  O(ed)  steps. 

Proof:  Version  1  and  version  2  of  the  bi-directional  Q-learning  algorithm  differ  only 
in  their  termination  criteria.  According  to  Theorem  44,  the  bi-directional  Q-learning 
algorithm  (version  1)  terminates  after  at  most  O(ed)  steps  if  ub(d)  =  d.  Let  t  be  the 
time  superscript  for  the  values  of  the  variables  upon  termination  of  the  bi-directional 
Q-learning  algorithm  (version  1).  Then,  [/^(s*-1)  <  —ub{d)  <  —d.  Then,  the  bi¬ 
directional  Q-learning  algorithm  (version  2)  terminates  no  later  than  step  t  according 
to  Theorem  51.  Thus,  it  finds  an  optimal  policy  and  terminates  after  at  most  0(ed ) 
steps.  According  to  Theorem  50,  it  terminates  with  an  optimal  policy. 


A. 4  Value-Iteration 

All  theorems  about  the  value-iteration  algorithm  or  the  bi-directional  value-iteration 
algorithm  can  be  proved  analogous  to  their  Q-learning  counterparts.  Therefore,  we 
omit  the  proofs  here.  Some  of  the  proofs  can  be  found  in  [14].  It  contains  proofs  for 
the  theorems  about  reaching  a  goal  state  with  an  admissible  value-iteration  algorithm 
if  the  initial  Q-values  are  consistent,  and  finding  optimal  policies  with  a  version  of  the 
bi-directional  value-iteration  algorithm  that  is  slightly  different  from  the  one  presented 
in  this  report.  With  only  minor  changes,  the  proofs  also  apply  to  the  more  general 
case  of  reaching  a  goal  state  with  an  admissible  value-iteration  algorithm  if  the  initial 
Q -values  are  admissible. 


A. 5  Reaching  a  Goal  State  with  Random  Walks  in  Eulerian 
State  Spaces 

Consider  a  random  walk  that  starts  in  s,(art  6  S  and  continues  forever.  Let  P  denote 
the  steady  state  probabilities  of  the  random  walk.  Formally, 

_  ..  ^(number  of  times  the  random  walk  visits  s  in  the  first  t  steps) 

Ps  :=  lim -  for  all  s  t 

<— >oo  t 

i.e.  P,  is  the  average  probability  in  the  long  run  that  the  random  walk  is  in  s.  P,  is 
well-defined  and  independent  of  .sJ(ar(,  since  the  state  space  is  strongly  connected.  It 
holds  that 
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For  all  s ,  s'  G  S,  let  T(s,  s')  denote  the  expected  number  of  steps  that  the  random  walk 
needs  to  enter  s'  for  the  first  time  if  sstart  =  s. 


Theorem  53  (Aleliunas  et  al.  [1])  Ya{s)\  =  \A(:s’)\  for  a ^  s's>  ^  ^  */  the  state  space 
is  Eulerian. 


Proof  by  contradiction:  Assume  not.  Then,  there  exist  s,s'  €  S  and  a  G  A(s')  with 


succ(s',  a)  =  s  such  that  both  yj^yy  =  maxs»6s 


and 


l-4(s')l 


Pm!! 


<  maxs»gs  |A(s”)|  • 


Rewriting  Equation  6  yields  yyy^yy  =  ~  J "  g  s  ^  “ e ° j  _I/H j  '1IL .  Since  the  state 
space  is  Eulerian,  it  holds  that  |Ajs)|  =  |{(s",  a)  :  succ(s",a)  =  sAs"  G  S/\a  G  A(s")}|. 
Thus,  r^yy  is  the  average  over  ^  of  all  predecessors  s"  of  s.  If  one  averages  over  a 
couple  of  numbers  all  of  which  are  smaller  than  or  equal  to  a  number  i  and  one  of  the 
numbers  is  strictly  smaller  than  i,  then  the  average  is  strictly  smaller  than  i,  too.  Since 


P, 

l*(« 


4j7yj  <  maxs»es  j A(y'Y\  f°r  the  predecessor  s'  of  s.  it  follows  that  p^yyy  <  maxs»es  py^yy, 


l-4(3')l  .  ...  ... 

which  is  a  contradiction. 


Theorem  54  (Aleliunas  et  al.  [1])  P3  =  hilill  for  all  s  G  S  if  the  state  space  is 
Eulerian. 


Proof:  1  Equ^n5  Ls-esPs-  =  ,, 

&  E.-6S  I-4MI  =  nfee  for  all  a  €  5.  Thus,  P.  =  M 


il-4(s')l) 


Theorem  53 


Theorem  55  (Aleliunas  et  al.  [1])  T(s,  succ{s,a))  <  e  for  all  s  €  S  and  a  G  .4(5) 
if  the  state  space  is  Eulerian. 

Proof:  P,  Theo=mM  ldiill  Thus,  the  average  probability  that  the  random  walk  is  in  s  in 
the  long  run  equals  ^i£ll  The  probability  that  it  executes  a  when  it  is  in  s  is  yqjyyf- 

The  joint  probability  that  it  is  in  s  in  the  long  run  and  executes  a  is  j~4(7)T  =  e- 
The  expected  number  of  steps  between  two  occurrences  of  the  agent  being  in  s  and 
executing  a  is  the  reciprocal  of  its  probability  in  the  long  run,  i.e.  it  is  e.  It  follows 
that  T(s,succ{s,a))  <  e. 


Theorem  56  T{s,  s')  <  ed  for  all  s.s'  G  -S’  if  the  state  space  is  Eulerian. 
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Proof:  There  is  a  path  from  s  to  s'  of  length  at  most  d.  The  expected  number  of  steps 
that  a  random  walk  needs  to  traverse  this  path  is  the  sum  of  the  expected  number  of 
steps  to  go  from  every  state  on  the  path  to  its  successor,  each  of  which  is  at  most  e 
according  to  Theorem  55.  Thus,  T(s,s')  <  ed. 


Theorem  57  A  zero-initialized  Q-learning  algorithm  with  goal-reward  representation 
reaches  a  goal  state  and  terminates  after  at  most  O(ed)  steps  on  average  if  the  state 
space  is  Eulerian. 


Proof:  A  zero-initialized  Q-learning  algorithm  with  goal-reward  representation  per¬ 
forms  a  random  walk  until  it  reaches  a  goal  state  for  the  first  time.  The  expected  num¬ 
ber  of  steps  needed  to  reach  a  goal  state  and  terminate  is  r3jtart  <  min 3gc  T(s3(ar<,  s). 

Theorem  56 

If  the  state  space  is  Eulerian,  then  min3€G  T{sstaTt,  s)  <  min3€ced  =  ed.  Thus, 
the  average-case  complexity  is  at  most  0(x3.lart)  <  O(ed). 


B  Lower  Bounds 

We  prove  all  lower  bounds  on  the  worst-case  complexity  of  reaching  a  goal  state  by 
giving  an  example  of  a  trace  (i.e.  state  sequence)  that  achieves  or  tops  the  bound. 
The  traces  are  specified  as  pseudo-code  that,  when  executed,  prints  the  state  sequence. 
The  for-to-loops  have  a  default  increment  of  one  and  the  for-downto-loops  have  a 
default  increment  of  minus  one,  unless  specified  otherwise  with  a  step-clause.  The 
statements  that  belong  to  the  body  of  a  loop  are  indicated  by  indentation.  They  are 
not  executed  if  the  range  of  the  loop  variable  is  empty.  For  example,  the  pseudo-code 
given  in  Chapter  B.l  specifies  the  trace  1234... n,  that  has  length  n  —  1  (i.e.  needs 
n  —  1  steps  to  execute). 


B.l  Reaching  a  Goal  State  with  any  Search  Algorithm 

Figure  18  shows  a  state  space  for  which  every  search  algorithm  needs  at  least  n  —  1 
steps  to  reach  the  goal  state  (for  n  >  1): 

for  i  :*  1  to  n 
print  i 
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B.2  Reaching  a  Goal  State  with  Uninformed  On-Line  Search 
Algorithms 

The  following  proofs  of  lower  bounds  that  hold  for  all  uninformed  on-line  search  al¬ 
gorithms  utilize  that  the  agent  cannot  distinguish  the  goodness  of  unexplored  actions 
in  the  current  state.  Similarly,  the  proofs  of  lower  bounds  that  hold  for  all  search 
algorithms  that  have  to  enter  a  state  at  least  once  before  they  know  the  successor 
states  utilize  that  the  agent  cannot  distinguish  among  unexplored  successor  states  of 
the  current  state.  In  both  cases,  we  supply  a  tie  breaking  rule  that,  when  followed  by 
the  agent,  achieves  or  tops  the  lower  bound.  For  all  domains  in  this  report,  this  rule  is 
to  prefer  actions  that  lead  to  states  with  smaller  numbers.  Every  uninformed  on-line 
search  algorithm  can  traverse  a  supersequence  of  the  traces  given  by  us. 

Figure  11  shows  a  domain  for  which  every  uninformed  on-line  search  algorithm  can 
need  at  least  (e  —  n  +  l)(n  —  1)  steps  to  reach  the  goal  state  (for  n  >  2  and  e  >  n): 

for  i  :=  1  to  e-n+1 
for  j  :=  1  to  n-1 
print  j 
print  n 


Figure  14  shows  a  domain  for  which  every  uninformed  on-line  search  algorithm  can 
need  at  least  (e  —  n)(n  —  2)  +  1  steps  to  reach  the  goal  state  (for  n  >  3  and  e  >  n  +  1): 

print  1 

for  i  :=  1  to  e-n 
for  j  : =  2  to  n-1 
print  j 
print  n 

An  off-line  search  algorithm  that  knows  the  topology  of  the  state  space  can  reach  the 
goal  state  in  one  step: 

print  1 
print  n 


Figure  15  shows  a  domain  for  which  every  uninformed  on-line  search  algorithm  can 
need  at  least  1/6 n3  —  1/6 n  steps  to  reach  the  goal  state  (for  n  >  1): 


for  i  :=  1  to  n-1 
print  i 

for  j  :=  1  to  i-1 
for  k  :=  j  to  i 
print  k 
print  n 

This  fact  and  Theorem  2  together  imply  that  the  complexity  of  an  admissible,  zero- 
initialized  Q-learning  algorithm  in  the  domain  shown  in  Figure  15  is  tight  at  0(n3). 


Figure  19  shows  a  domain  for  which  every  uninformed  on-line  search  algorithm  can 
need  at  least  e  +  n  —  4  steps  to  reach  the  goal  state  (for  n  >  2  and  e  >  2 n  —  2): 

for  i  :=  1  to  (e-2n+4)/2 
print  2 
print  1 

for  i  :=  2  to  n-2 
print  i 
print  i+1 
print  i 
print  n-1 
print  n 


Figure  23  shows  a  domain  for  which  every  algorithm  that  has  to  enter  a  state  at  least 
once  before  it  knows  the  successor  states  can  need  at  least  1  /2 n2  —  1  /2 n  steps  to  reach 
the  goal  state  (for  n  >  1): 

for  i  : =  1  to  n-1 
for  j  :=  i  dovnto  1 
print  j 
print  n 

An  off-line  search  algorithm  that  knows  the  topology  of  the  state  space  can  reach  the 
goal  state  in  one  step: 


print  1 
print  n 


Figure  25  shows  a  domain  for  which  every  algorithm  that  has  to  enter  a  state  at  least 
once  before  it  knows  the  successor  states  can  need  at  least  1/4 n2  —  1  steps  to  reach  the 
goal  state  (for  even  n  >  1): 
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for  i:=  1  to  n-3  step  2 
for  j  :=  1  to  i  step  2 
print  j 

for  j  :=  i+1  downto  2  step  2 
print  j 

for  i  :=  1  to  n-1  step  2 
print  i 


B.3  Reaching  a  Goal  State  with  Q-Learning  and  Value- 
Iteration 

Because  an  admissible,  zero-initialized  Q-learning  algorithm  is  an  uninformed  on-line 
search  algorithm,  the  lower  bounds  proved  above  for  uninformed  on-line  search  algo¬ 
rithms  also  hold  for  Q-learning.  Similarly,  since  an  admissible,  zero-initialized  value- 
iteration  algorithm  is  an  algorithm  that  has  to  enter  a  state  at  least  once  before  it 
knows  the  successor  states,  the  lower  bounds  proved  above  for  these  algorithms  also 
hold  for  value-iteration.  In  the  following,  we  provide  traces  for  the  domains  used  in  the 
main  text  that  are  not  covered  by  the  above  proofs  or  for  which  we  can  prove  a  larger 
lower  bound.  (See  chapter  B.4  for  an  analysis  of  the  domain  shown  in  Figure  12.) 


Figure  17  shows  a  domain  for  which  an  admissible,  zero-initialized  Q-learning  algorithm 
can  need  at  least  l/16n3  +  3/8 n2  —  3/16 n  —  1/4  steps  to  reach  the  goal  state  (for  n  >  1 
with  n  mod  4=1): 


for  i  :=  n-1  downto  (3n+l)/4 
for  j  :=  (n+l)/2  downto  1 
print  j 

for  j  :=  2  to  (n+l)/2 
print  j 

for  k  : =  1  to  j-2 
print  k 
print  j 

for  j  :=  (n+3)/2  to  i-1 
print  j 

for  j  :=  i  downto  (n+3)/2 
print  j 

for  j  :=  (n+l)/2  downto  1 
print  j 

for  j  :=  2  to  (n+l)/2 
print  j 

for  k  :■  1  to  j-2 
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print  k 
print  j 

for  i  :=  (n+3)/2  to  n 
print  i 


Figure  19  shows  a  domain  for  which  an  admissible,  zero-initialized  Q-learning  algorithm 
can  need  at  least  1/2 en  —  1/4 n2  —  1/2 n  +2  steps  to  reach  the  goal  state  (for  even  n  >  2 
and  e  >  2 n  —  2): 


for  i  :=  n-1  downto  (n+2)/2 
for  j  :=  1  to  (e-2n+4)/2 
print  2 
print  1 

for  j :=  2  to  i-1 
print  j 

for  j : =  i  downto  3 
print  j 

for  i  :=  1  to  (e-2n+4)/2 
print  2 
print  1 

for  i  :*  2  to  n 
print  i 


Figure  22  shows  a  domain  for  which  an  admissible,  zero-initialized  value-iteration  al¬ 
gorithm  can  need  at  least  n2  —  n  steps  to  reach  the  goal  state  (for  n  >  1): 

for  i  :«  1  to  n-1 
for  j  :«  1  to  i 
print  i 

for  j  :*  i  downto  1 
print  j 
print  n 

An  off-line  search  algorithm  that  knows  the  topology  of  the  state  space  can  reach  the 
goal  state  in  one  step: 


print  1 
print  n 
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Figure  24  shows  a  domain  for  which  an  admissible,  zero-initialized  value-iteration  al¬ 
gorithm  can  need  at  least  3/1 6n2  —  3/4  steps  to  reach  the  goal  state  (for  n  >  1  with 
n  mod  4  =  2): 


for  i  :=  n-3  downto  n/2  step  2 
for  j  :=  1  to  i  step  2 
print  j 

for  j  :=  i+1  downto  2  step  2 
print  j 

for  i  :=  1  to  n-1  step  2 
print  i 


B.4  Reaching  a  Goal  State  -  A  More  Complicated  Case 

Consider  an  undiscounted,  admissible,  and  zero-initialized  Q-learning  algorithm  that 
operates  in  a  reset  state  space  of  size  n  >  1.  As  shown  in  Figure  8,  the  actions  a1 
decrease  the  goal  distance,  whereas  the  actions  a2  lead  the  agent  back  to  state  1 . 

Theorem  58  U(s)  <  ^y1  for  all  s  €  S  after  termination  of  an  undiscounted ,  admis¬ 
sible,  and  zero-initialized  Q-learning  algorithm  in  a  reset  state  space  of  size  n  >  1  if 
ties  are  broken  in  favor  of  actions  that  lead  to  states  with  smaller  numbers. 

Proof  by  induction  on  n.  Let  t'  be  the  step  when  the  algorithm  has  terminated,  i.e. 
Ut  (s)  are  the  final  (/-values. 


•  If  n  =  1,  then  the  algorithm  terminates  immediately,  since  s3tart  =  1  €  G.  All 
Q-values  remain  zero,  and  (/‘  (s)  =  maxae4(,)  Q^is,  a)  =  maxa€A(,)  0  =  0  =  £y_ 
for  all  s  €  S  =  { 1 } . 

•  Assume  that  the  theorem  holds  for  an  arbitrary  n  >  1  and  consider  now  a  reset 
state  space  of  size  n  4-  1. 

When  the  agent  enters  state  n+  1,  the  algorithm  terminates  before  changing  any 
Q-value  in  that  state.  The  values  Q(n  +  l,a)  remain  zero  for  all  a  €  A(n  -I-  1), 
and  £/*'(n  +  1)  =  maxo€/t(n+ij  Q1' (n  +  l,a)  =  maxo€/i(n+i)  0  =  0= 

State  n  +  1  is  the  only  goal  state.  In  order  to  reach  it  from  sstart  =  1-  the 
agent  has  to  visit  state  n  at  least  once.  Let  t  <  t'  be  the  earliest  step  at  which 
sl  =  n.  Since  the  states  s  <  n  form  (together  with  their  actions  that  do  not  leave 
this  set  of  states)  a  reset  state  space  of  size  n,  it  holds  that  U*(s)  <  for  all 
s  6  {1,2,. ..,n}  according  to  the  assumption.  Since  Q‘(n,ai)  =  Q*(n,a 2)  =  0, 
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succ(n,a 2)  =  1  <  n  +  1  =  succ(n,a\),  and  ties  are  broken  in  favor  of  actions  that 
lead  to  states  with  smaller  numbers,  the  agent  chooses  a 1  =  a2  for  execution  at 
step  t  and  is  in  state  st+1  =  1  afterwards.  In  order  to  reach  the  goal  state  n  +  1 
from  there,  the  agent  has  to  execute  action  a j  in  every  state  s  £  {1,2, ...,n} 
at  least  once.  Let  t'  >  t(s)  >  t  be  a  step  at  which  s^3'1  =  s  €  {1,2, ...,n}  and 
=  ax. 

The  following  argument  holds  for  all  s  €  {2, 3...,n  —  1}:  g1(4)(.s,ai)  > 

Q^3\s,  a2),  since  ai  is  chosen  over  a2  at  step  t(s).  (If  Ql^(s,a ])  <  Ql^(s,  a2), 
then  a2  would  be  chosen  over  aj,  since  succ(s,a2)  =  1  <  s  +  1  =  succ(s,ai)  and 
ties  are  broken  in  favor  of  actions  that  lead  to  states  with  smaller  numbers.)  Since 

.  Theorem  22  .  Theorem  22 

the  Q-values  are  integers,  Q‘  ( s,a2 )  <  Qi(3,(s,a 2)  <  gt(i,)(s,ai)  —  1  < 

Q\s,ai)  -  1  <  ma x(Qt(s,a1),Qt(s,a2))  -  1  =  U*(s)  -  1  <  *=*  -  1  < 

<3‘(s,ai)  <  <5t(^+1(s,ai)  =  -1  +  Ut(3)(succ(s,ai))  <  -1  +  Ut(s  + 

1)  <  -1  +  =  *=a=l.  Then,  U*(s)  =  ma x(Qf,(s,a,),g‘'(3,a2))  < 

A  similar  argument  holds  for  state  1:  0^(1)  =  maxa  eA(i)  Q*  (T  a)  — 

g‘  (l,a!)  <  =  -l  +  ^HsnaKl.ai))  <  -1  +  U*{2)  < 

_ 11  2— n  — n  1  —  (n+1) 

A  •  2  2  2 

A  similar  argument  also  holds  for  state  n:  Q((n)(n,a2)  <  <3t(n)(n,ai)  <  0.  Since 

.  Theorem  22  .  ,  Theorem  22 

the  Q-values  are  integers,  Ql  (n,a2)  <  Q C(n,(n,a2)  <  — 1.  Q  (n,ax)  < 

g*(n)+1(n,ai)  =  -1  +  [/^(succin^O)  <  -1  +  0  =  -1.  Then,  U*  (n)  = 

max(Qi,(n,a1),gt'(n,a2))  <  max(  —  1,—  1)  =  — 1  <  — i  =  n~^+1I. 

To  summarize,  U^is)  <  for  all  s  6  5  =  {1,2, ...  n  +  1}. 


Theorem  59  <3(1,  ai)  <  after  termination  of  an  undiscounted,  admissible ,  and 
zero-initialized  Q-learning  algorithm  in  a  reset  state  space  of  size  n  >  1  if  ties  are 
broken  in  favor  of  actions  that  lead  to  states  with  smaller  numbers. 


Theorem  58 

Proof:  <3(1,  ai)  =  maxa  SAWQ(l,a)  =  U(l)  <  '-?■  after  termination. 


Theorem  CO  <3(1,  dj)  <  when  an  undiscounted,  admissible,  and  zero-initialized 
Q-learning  algorithm  is  the  last  time  in  statr  1  of  a  reset  state  space  of  size  n  >  2  if 
ties  are  broken  in  favor  of  actions  that  lead  to  states  with  smaller  numbers. 

Proof:  Let  t  be  the  last  step  at  which  s*  =  1,  and  t'  be  the  earliest  step  at  which 
s*'  =  n  —  1.  The  states  s  <  n  —  1  form  (together  with  their  actions  that  do  not  leave 


89 


this  set  of  states)  a  reset  state  space  of  size  n  —  1.  $‘  (1,  «i)  <  — p- ^  =  p2  according 
to  Theorem  59.  Since  Q1' (n  —  l,ai)  =  Q1' [n  —  l,a2)  =  0,  succ{n  —  l,a2)  =  1  <  n  = 
succ(n  —  l,ai),  and  ties  are  broken  in  favor  of  actions  that  lead  to  states  with  smaller 
numbers,  the  agent  chooses  a 1  =  u2  for  execution  at  step  t'  and  is  in  state  sl  +1  =  1 

Theorem  22  . 

afterwards.  Thus,  t'  <  t  and  <2((l,<zi)  <  Ql  (l,«i)  <  p2-. 


Theorem  61  Figure  12  shows  a  domain  for  which  an  admissible,  zero-initialized  Q- 
leaming  algorithm  needs  at  least  =  l/16n3  —  3/16 n2  —  1/1 6n  +  3/16 

steps  to  reach  the  goal  state  and  terminate  (for  odd  n  >  3)  if  ties  are  broken  in  favor 
of  actions  that  lead  to  states  with  smaller  numbers. 


Proof:  The  states  s  €  {pp  2p, . . . ,  n}  form  (together  with  their  actions  that  do  not 
leave  this  set  of  states)  a  reset  state  space  of  size  ^p.  ss(art  =  2p  is  also  the  start 
state  of  the  reset  state  space.  Note  that  state  ^p  separates  the  states  that  belong 
to  the  reset  state  space  from  the  other  states.  Consider  the  trace  that  an  admissible, 
zero-initialized  Q-learning  algorithm  traverses  in  this  reset  state  space  if  ties  are  broken 
in  favor  of  actions  that  lead  to  states  with  smaller  numbers.  This  state  sequence  is 
a  subsequence  of  the  state  sequence  that  an  admissible,  zero-initialized  Q-learning 
algorithm  traverses  in  the  whole  domain  (as  shown  in  Figure  12)  if  ties  are  broken  in 
favor  of  actions  that  lead  to  states  with  smaller  numbers. 


Let  t'  be  the  step  when  the  algorithm  has  terminated  in  the  domain  from  Figure  12 
of  size  n  (for  odd  n  >  3)  if  ties  are  broken  in  favor  of  actions  that  lead  to  states 
with  smaller  numbers,  i.e.  Ql  (s,a)  are  the  final  Q- values.  We  assume  (without  loss  of 


Theorem  59 


.  i  iicuiciii 

generality)  that  the  Q-learning  algorithm  is  undiscounted.  Then,  Ql  (ppaQ  < 

— y2—  =  pp,  where  a\  €  .^(p5-)  is  the  action  in  state  ^p  that  decreases  the  goal 
distance.  Similarly,  let  t"  <  t'  be  the  last  step  at  which  sf"  =  *p-  and  a l"  =  a\.  Then, 


,a\) 


Theorem  60 
< 


3— n 

4  * 


Now  consider  an  arbitrary  a  €  A(sp)  \  {a!}.  Qpppa)  <  Qppp  av).  (If 
->  (^"(ppai),  then  a  would  be  preferred  over  aj,  since  succ(pp,a)  < 
2p-  <  pp  =  succ(2p-,ai)  and  ties  are  broken  in  favor  of  actions  that  lead  to  states 
with  smaller  numbers.)  Since  the  ^-values  are  integers,  ^"(ppa)  <  Ql  (p^ai)  — 
1  <  —  1  <  0.  Thus,  there  is  a  step  when  a  is  executed  in  state  pp,  since  ini¬ 
tially  =  0-  Note  that  every  path  from  a  state  smaller  than  ^p  to  state 

n  contains  an  execution  of  ax  in  state  2p,  since  since  aj  is  the  last  action  exe¬ 
cuted  in  pp.  Let  t(a )  <  t"  be  the  last  step  when  a  is  executed  in  2p.  For  all 

a'  €  /4(succ(2p,  a))  it  holds  that  Ql  (s«cc(Iip■,a),a,)  <  Ql (al(succ(2p-,a),a')  < 

maxa''e^ucc(^,a))Qi(a)(^cc(:ip-1a)1a")  =  UtM{succ{^,a))  =  1  +  Q‘(«>+i(a±I,a)  = 

.  ,  „  Theorem  22  _ 

1  +  (^‘(^pja)  <  1  4-  Ql  (^p^)  <  Q*  (^,01)  <  -1.  Note  that  for  every 
s  €  {1,2, ...,ap}  both  IA(s)l  =  ^p  and  there  exists  an  a  €  ^(^p)  \  {®i}  such 
that  sttc^^p,*!)  =  s.  Because  the  state  space  has  no  identity  actions,  Q'  (s.a)  <  0 
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Figure  35:  Run-time  of  an  admissible,  zero-initialized  Q-learning  algorithm  for  reaching 
the  goal  state  in  the  domain  shown  in  Figure  12  (for  odd  n  €  [3.99])  if  ties  are  broken 
in  favor  of  actions  that  lead  to  states  with  smaller  numbers 


for  all  s  €  S  and  a  (E  A(s),  W(n)  =  0  (since  state  n  is  a  goal),  and  |.4(s)|  > 
for  every  state  s  <  it  holds  that  t'  Th'°^m26  W(n)  —  T,ses^2aeA(s)Qt'{s'a)  > 

n  —  T"  ,Dt'(^  n\  >  —  V  3~n  >  n  +  l  n-l  n-3  _  (n+1  )(n-l)(n-3)  _ 

^  2_,3=1  Z-iaZAts)  —  2_<j=l  4  —  2  2  4  16 

l/16n3  -  3/16tr  -  1/16  +  3/16. 


Theorem  2  and  Theorem  61  together  imply  that  the  complexity  of  an  admissible,  zero- 
initialized  Q-learning  algorithm  in  the  domain  shown  in  Figure  12  is  tight  at  0(n3).  The 
actual  number  of  steps  needed  by  an  admissible,  zero-initialized  Q-learning  algorithm 
to  reach  the  goal  state  and  terminate  in  the  domain  shown  in  Figure  12  if  ties  are 
broken  in  favor  of  actions  that  lead  to  states  with  smaller  numbers  is  not  a  smooth 
function  of  n.  Since  we  ignored  the  Q-values  of  the  states  s  6  { 2^, . . . ,  n}  in 
the  proof  of  Theorem  61,  the  lower  bound  l’1tl)(n.~1),(n~:3}  is  not  tight.  Figure  35  shows 
both  the  actual  number  of  steps  needed  and  the  lower  bound  for  odd 

n  6  [3,99]. 


Theorem  62  Figure  12  shows  a  domain  for  which  the  Qmap-Iearning  algorithm  needs 
at  most  3/8 n2  4-3/2n  —  23/8  steps  to  reach  the  goal  state  and  terminate  (for  odd  n  >3J. 
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Proof:  Note  that  state  separates  the  states  smaller  than  from  the  states  larger 
than  In  the  following,  we  assume  that  an  arbitrary  trace  is  given  that  the  Qmap- 
learning  algorithm  has  traversed  when  reaching  the  goal  state  from  sstart  =  n-^-.  We 
analyze  the  trace  in  both  subsets  of  states  separately. 

•  Consider  the  states  larger  than  or  equal  to  It|I.  We  call  the  actions  that  decrease 
the  goai  distance  action  Gq,  and  the  actions  that  lead  the  agent  back  to  state 
action  02-  Consider  any  consecutive  subsequence  of  the  trace  that  starts  and  ends 
in  state  and  otherwise  contains  only  states  larger  than  — This  subsequence 
must  contain  an  action  a2  that  is  unexplored.  (Since  the  start  state  and  end  state 
of  the  sequence  are  identical,  it  must  contain  at  least  one  unexplored  action.  If 
it  contains  an  unexplored  action  a2,  the  statement  is  true.  If  it  contains  an 
unexplored  action  aj  6  j4(s),  then  all  states  s'  with  s'  >  s  are  still  unexplored 
and  thus  no  action  has  yet  been  executed  in  them.  In  particular,  the  action  a2 
that  the  agent  must  execute  in  a  state  larger  than  s  to  reach  state  again  is  still 
unexplored.)  If  a2  €  .4(s),  then  the  length  of  the  subsequence  is  1  +s  —  There 
can  be  at  most  one  such  subsequence  for  every  state  s  6  { !L^,  s^-, . . . ,  n  —  1}, 
since  there  is  only  one  action  a2  in  every  such  state.  In  addition,  there  is  one 
subsequence  of  length  n  —  ^44-  =  ILji  that  leads  from  state  ^41  to  the  goal 
state  n.  Since  these  are  all  of  the  subsequences  that  contain  states  larger  than 

the  total  of  action  executions  in  this  part  of  the  state  space  is  at  most 
En:Lt3(l  +  5  -  s±i)  +  =  1/8 n2  +  1/2 n  -  13/8. 

s-  2 

•  Now  consider  the  states  smaller  than  or  equal  to  Delete  first  all  states 

from  the  original  trace  that  are  followed  by  a  state  larger  than  and  delete 
afterwards  all  states  larger  then  (We  have  accounted  for  all  action  executions 
in  these  states  in  the  paragraph  above.)  Note  that  zero-initialized  Q-learning 
always  executes  an  unexplored  action  in  the  current  state  if  one  is  available,  since 
the  action  execution  step  always  executes  the  action  with  the  largest  Q-value  and 
unexplored  actions  have  a  Q-value  of  0  whereas  explored  actions  have  a  Q-value 
smaller  than  0.  Consequently,  ^map-learning  also  executes  unexplored  actions 
if  they  are  available.  Since  the  states  smaller  than  or  equal  to  ^41  (together 
with  the  actions  that  do  not  leave  this  set  of  states)  form  a  1-step  invertible 
state  space,  the  shortened  state  sequence  contains  only  executions  of  unexplored 
actions  until  all  actions  in  state  are  explored.  Then,  the  following  behavior 
is  repeated  until  all  actions  are  explored:  The  agent  executes  an  explored  action 
that  leads  to  a  state  s  in  which  at  least  one  action  is  still  unexplored.  Then,  the 
agent  executes  only  unexplored  actions,  until  all  actions  in  state  s  are  explored, 
optionally  followed  by  an  explored  action  that  leads  to  state  !i^-.  To  summarize, 
every  of  the  acti°ns  in  this  part  of  the  state  space  is  executed  exactly 

once  plus  at  most  two  actions  for  every  state  smaller  than  — Ai.  leading  to  a  total 
number  of  action  executions  of  441  +  22yA  =  l /■in2  -f  n  —  5/4. 

In  conclusion,  the  total  number  of  action  executions  is  bounded  from  above  by  ( I/Sn2  + 
1/2 n  —  13/8)  4-  (1/4 n2  +  n  —  5/4)  =  3/8 n2  +  3/2 n  —  23/8  no  matter  how  ties  among 
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actions  that  have  the  same  Q-value  are  broken. 


Note  that  every  uninformed  on-line  search  algorithm  can  need  at  least  1  /2 n2  +  1  /2 n  —  2 
steps  to  reach  the  goal  state  in  a  reset  state  space  (shown  in  Figure  8)  of  size  n  (for 
n  >  2): 


for  i  :=  2  to  n 
for  j  :=  1  to  i 
print  j 

Now  consider  again  the  domain  shown  in  Figure  12  and  assume  it  has  size  n.  As 
remarked  earlier,  the  states  s  >  form  a  reset  state  space  of  size  Thus,  every 
uninformed  on-line  search  algorithm  can  need  at  least  l/2(21^i)2  +  l/2(n^-)  —  2  = 
l/8n2  +  1/2 n  —  13/8  steps  (for  odd  n  >  3).  This  fact  and  Theorem  62  together  imply 
that  the  complexity  of  the  Qmap-learning  algorithm  in  the  domain  shown  in  Figure  12 
is  tight  at  0(n2). 


B.5  Reaching  a  Goal  State  with  Random  Walks 

Equations  4  allow  one  to  determine  the  average  number  of  st^ps  required  by  a  random 
walk  to  reach  a  goal  state  in  a  given  domain,  as  outlined  in  the  main  text.  Consider, 
for  example,  the  reset  state  space  shown  in  figure  8.  The  corresponding  equations  are 

X!  =  1  +  X2 

x3  =  1  +  0.5xi  +  0.5xs+i  for  all  s  €  {2, 3, - n  —  1} 

Xn 


They  can  be  solved  for  xSstart  as  follows 

=  X1 
—  1  +  X2 

=  1  +  1  +  0.5xi  +  0.5x3 

=  1  -I-  (1  +  0.5)  +  0.5x!(1  +0.5)  +  0.52x4 

=  1  +  (1+  0.5  +  0.52)  +  0.5x!(1  +  0.5  +  0.52)  +  0.53x5 


9 — 3  s— 3 

=  l +^0.5‘ +0.5x,£0.5' +0.5’-2x,  for  alls  6  {2, 3 . n} 

t=0  1=0 


n— 3  n—3 

=  1  +  53  °-5'  +  0.5Z!  53  0.5‘  +  0.5n-2x„ 

<=0  i=0 
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i  _  n  cin~2  i  —  n  3n~2 

=  1  +  —  +  0.5xj  ,  -'I,-  +  0.571"2  x  0 


1-0.5 
=  3  x  2"-2  -  2 


1  -  0.5 


The  equations  for  the  domains  in  figures  9,  17,  and  18  can  be  derived  in  a  similar  way. 
In  order  to  solve  them,  one  usually  uses  generating  functions,  see  for  example  [7]  for 
a  mathematical  derivation  and  [37]  for  an  application  in  the  context  of  reinforcement 
learning.  Since  these  derivations  are  long  and  cumbersome,  we  do  not  state  them  here, 
but  make  use  of  lhe  fact  that  the  solution  is  unique  and  show  how  to  check  a  given 
solution  by  plugging  it  into  the  equations.  Consider  for  example  the  one  dimensional 
gridworld  shown  in  figure  18.  The  corresponding  equations  are 

xx  =  1  +  x2  (7) 

xa  —  1  -f  0.5x,_i  +  0.5xs+t  for  all  s  £  {2, 3, . . . ,  n  —  '  ;  (8) 

xn  =  0  (9) 


Consider  the  solution  for  xa,tart  stated  in  figure  18 

=  xi  =  n2  —  2n  +  1 


^  Sitart 


The  solutions  for  the  other  variables  xs  can  be  derived  from  this  solution  as  follows 


Xj  =  1  +  x2  =  n2  —  2  n  +  1 
x2  =  1  +  0.5xi  -f  0 .ox j  —  n2  ~  2 n 
x3  =  1  T  0.5x2  +  0.5x,i  -  n 2  —  2 n  —  3 


x2  =  n2  —  2  n 
x3  =  n2  —  2  n  —  3 
x4  =  n2  —  2  n  —  8 


3_i  =  l  +  0.5x3_2  +  0.5xs  =  n2  —  2n  —  {s  —  l)2  +  2(s  —  1) 


=  n2  —  2n  —  s2  +  2 


for  s  €  {3, 4, . . . ,  n}.  Thus, 

xs  =  n2  —  2n  —  s2  +  2s  for  all  s  €  S  =  {1, 2, . . . ,  n} 


The  x,  for  s  €  S  were  constructed  to  satisfy  Equation  7  and  Equations  8.  In  order 
to  prove  that  the  x,  are  indeed  a  solution  of  the  set  of  equations,  we  are  left  to  verify 
Equation  9  and  indeed 

xn  =  n2  —  2n  —  n2  +  2n  =  0 


C  Value-Iteration  and  Q-Learning 

In  the  following,  we  use  the  transformation  rule,  terminology,  and  definitions  of  symbols 
introduced  in  Chapter  6.1  to  show  how  results  about  value-iteration  can  be  transferred 
to  Q-learning  and  vice  versa. 
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The  time  superscripts  used  in  this  chapter  refer  to  the  values  of  the  variables  immedi¬ 
ately  before  the  action  execution  step  of  the  Q-learning  or  value-iteration  algorithm, 
i.e.  line  4,  of  step  t. 


Theorem  63  For  all  s  G  S\G  anda  G  A(s),  it  holds  that  gd(s3,a)  =  gd(succ(s,a))- fl. 

Proof:  Assume  that  we  execute  action  ay  G  A(sy)  in  state  sy  G  S  at  time  1  <  y  < 
x  :=  gd(succ(s,  a))  while  following  a  shortest  path  from  state  Si  :=  succ(s,a)  to  the 
closest  goal  state  sx+i  :=  s«cc(sx,ax)  G  G.  Let  ax+1  G  A(sx+i)  be  an  arbitrary  action 
in  sx+i.  Then,  the  state  sequence  s3<a,  s3liaj ,  s32,a2, . . . ,  s3xMl,  s3x+1,ax+J  of  states  in  5 
describes  a  path  from  state  s3<a  to  the  goal  state  5Sl+liai+1  G  G  of  length  x  A  1.  Thus, 
gd(s,ta)  <  x  +  1  =  gd(succ(s,a))  +  1. 

Conversely,  assume  that  we  execute  action  dy  G  A(sy)  in  state  sy  =  s3y,ay  6  5  at  time 
1  5-  V  <  x  gd(sSA)  while  following  a  shortest  path  from  state  si  :=  s3,a  to  the  closest 
goal  state  sx+ \  :=  succ(sx,ax)  G  G.  Then,  the  state  sequence  s 2,53,.. .  ,  sx+1  of  states 
in  S  describes  a  path  from  s 2  =  succ(s,a)  to  the  goal  state  sx+1  G  G  of  length  x  —  1, 
which  cannot  be  shorter  than  gd{succ(s,a)).  Thus,  gd{s3A)  =  x  >  gd(succ(s,a))  A  1. 

Put  together,  it  follows  that  gd(s3<a)  =  gd(succ(s,  a))  +  1. 


Theorem  64  d  <  d  <  d  A  1  • 

Proof:  We  prove  d  <  d  and  d  <  d  A  1  separately. 

Consider  two  arbitrary  states  s3.a,s3/,a<  €  S.  We  show  that  d(s3.a,  s3>ia<)  <  d  A  l- 
Assume  that  we  execute  action  ay  G  A(sy)  in  state  sy  G  S  at  time  1  <  y  <  x  := 
d(succ(s,  a),  s')  <  d  while  following  a  shortest  path  from  state  si  :=  succ(s,a)  to  state 
sx.^i  .  —  succ{ sx ,  ax )  .  —  s  .  T hen ,  the  state  sequence  ,  * .  ■ ,  '  s3^a* 

of  states  in  S  describes  a  path  from  state  s3,0  to  state  sy,0<  of  length  x  +  1.  Thus. 
s3'  ,a' )  <£+1.  <  d  +  1  and  d  =  max~y,€j  d(s,  s')  =  ma  x-~,€j(d  +  1)  <  d  A  1. 

Next,  consider  two  arbitrary  states  s,s'  G  5  together  with  two  arbitrary  actions  a  G 
A(s)  and  a’  G  A(s').  We  show  that  d  <  d(s,s').  Assume  that  we  execute  action 
ay  G  A(sy)  in  state  sy  =  s3y,ay  G  5  at  time  1  <  y  <  x  :=  d(s3,0,  s3<ia>)  <  d  while 
following  a  shortest  path  from  state  s  1  :=  s3((J  to  state  sx+i  :=  succ(sx,  ax)  :=  s3<,a<. 
Then,  the  state  sequence  sj,  s2, . . . ,  sr+i  of  states  in  5  describes  a  path  from  state 
si  =  s  to  state  sx+i  =  s'  of  length  x,  which  cannot  be  shorter  than  d(s,s').  Thus 
d  >  x  >  d(s,s')  and  d  —  max3i3/es  d(s,  s')  <  ma x3,3<^sd  =  d. 


Figures  36  and  37  show  that  indeed  both  d  =  d  and  d  =  d  A  1  are  possible.  Since 
d  <  d  A  1,  the  transformed  domain  is  guaranteed  to  be  strongly  connected. 
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(a)  Original  domain 


start  goal 


(b)  Transformed  domain 


goal 


start 


Figure  36:  A  domain  with  d  =  n  —  1  and  its  transformation  with  d  =  d 


Theorem  65  Undiscounted,  admissible  value-iteration  with  action-penalty  represen¬ 
tation  behaves  in  the  transformed  domain  identically  to  undiscounted,  admissible  Q- 
learning  with  action-penalty  representation  in  the  original  domain  if  the  original  do¬ 
main  has  no  identity  actions,  initially  Q(s,a)  =  U(s3_a)  for  all  s  £  S  and  a  6  -4(3). 
and  ties  are  broken  in  the  same  way.  Also,  U‘(s3>a)  =  Q‘(s,a)  for  all  s  €  S ,  a  €  -4(3), 
and  steps  t  (until  termination). 

Proof:  We  prove  by  induction  on  t  that  (7*(sJia)  =  s3i >ai  for  all  s  €  S,  a  €  -4(3).  and 
steps  t  (until  termination).  At  the  same  time,  we  prove  that  s*  =  ssti0t  for  all  steps  t 
(until  termination). 

•  Initially,  s1  =  s3tart  =  s3star t,argmaxaeAI,3tart)Q(3„art,a)  =  Qis^a1)  if  ties  are  broken 
in  the  same  way.  Also,  Ul(s3<a)  =  (^(3,  a)  for  all  3  €  5  and  a  €  A(s)  per 
definition. 

•  Assume  that  the  two  statements  hold  for  step  t. 

First,  we  prove  that  2‘  =  ai+l ,  i.e.  both  algorithms  select  the  same  actions. 

2‘  =  arg  m  ax~g  ~( ~, }  [/  ‘  ( s  ucc( s* ,  2 ) ) 

=  argmaXa€4(v  ^(/‘(sUCcfs,.,,.^)) 


(a)  Original  domain 


=  argmax~€j(a^^)(/t(s3ucc(s,)at)~) 

=  argmaxa,€/i(jticc(sl  ^Q^succl-s4,  «'),  a) 

=  argmaxa,€^(st+1)g‘(5i+1,a') 

=  argmax0,6A(at+i)gt+1(5t+1,a) 

=  at+1 

if  ties  are  broken  in  the  same  way.  Note  that  g(s‘,a()  is  the  only  Q-value  of  the 
original  domain  that  changes  from  step  t  to  step  t  4-1.  Since  the  domain  has  no 
identity  actions,  it  holds  that  s‘  st+1  and  therefore  g<(s‘+1,a')  =  Qt+l($t+l ,a') 
for  all  a '  £  A(st+1). 

Next,  we  prove  that  f/t+1(5t)  =  gt+1(^£,  a1). 

=  max  (  —  1  +  f/‘(s'ucc(s<,a))) 

a&A(al) 

=  max  (  —  1  +  Ut{suc.c{s,tat,a))) 

a^A(aatat) 

=  __  max  (-1  +  ^<(ssucc(s.,a.),j)) 
o€>l(s,i  at ) 
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=  —1+  max  Qt(succ(st,  a(),  a') 

a'£A(succ(sc  ,a‘)) 


=  —1  +  U*(succ(s‘,  a1)) 

=  Qi+W) 


Since  the  only  U- value  that  changes  from  step  t  to  step  t  +  1  in  the  transformed 
domain  is  U(P)  =  (/($*»  ,0*)i  the  only  Q- value  that  changes  during  the  same  time 
span  in  the  original  domain  is  Q(st,at),  and  (/*(s3i0)  =  Q‘(s,a)  for  all  s  6  S  and 
a  E  A(s),  it  follows  that  Ut+1(ss>a )  =  Qt+l(s,a)  for  all  s  E  S  and  a  €  A(s)  as 
well. 

Finally,  we  prove  that  s*+1  =  sst+i  a.+i.  s*+1  =  succ(st,a<)  =  succ(sst,a t,a*)  = 

^succ(a,,at),at+1  =  ■®a,+1,o*+1, 

Q-learning  terminates  at  the  earliest  step  t  at  which  s‘  €  G.  Thus,  s1'  £  G  for  t'  <  t. 
Since  P'  =  sa,<  ati  for  all  steps  t',  it  follows  that  P'  £  G  for  t'  <  t  and  P  E  G.  Since  the 
value-iteration  algorithm  terminates  at  the  earliest  time  at  which  its  current  state  is  a 
goal  state,  it  terminates  at  step  t. 


Theorem  66  The  U -values  of  the  transformed  domain  are  consistent  iff  the  Q-values 
of  the  original  domain  are  consistent. 

Proof:  Note  that  s  =  s3,a  E  G  iff  s  E  G,  for  all  s  E  S  and  a  E  A(s).  Also  U(s3ta)  = 
Q{s,a)  for  all  s  E  S  and  a  E  A(s)  according  to  Theorem  65.  The  (/-values  are 
consistent  iff,  for  all  s3<a  E  <5,  U(s3,a)  =  0,  and,  for  all  s3.a  E  S\G,  max~£ . a)(  —  1  + 
U{succ(s,ia,a)))  <  U{s3A)  <  0. 

U(s3<a )  =  0  for  all  sJia  E  G  is  equivalent  to  Q(s,a)  =  U{s3^a)  =  0  for  all  s  E  G. 

maXoeA(«  )(~1  +  U{succ(s3 ,a,a)))  <  U(s3,a)  <  0  for  all  ss,a  E  5  \  G  is  equivalent  to 
the  following  inequalities 

_max  (-1  +  ^(sJucc(ata)-))  <  U (s3,a)  <  0 

a€A(sj,a) 

—  1+  max  Q(succ(s,a),a')  <  Q(s,a)  <  0 

a'€A(succ(s,a)) 

—  1  +  U(succ(s,a))  <  Q{s,a)  <  0 


for  all  s  E  S  \  G. 

Put  together,  it  follows  that  the  (/-values  of  the  transformed  domain  are  consistent  iff 
the  Q-values  of  the  original  domain  are  consistent. 


Theorem  67  The  U -values  of  the  transformed  domain  are  admissible  iff  the  Q-values 
of  the  original  domain  are  admissible. 
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Proof:  Note  that  S  =  sJ>0  G  G  iff  s  G  G,  for  all  s  G  S’  and  a  G  A(s).  Also  [/(ss,a)  = 
Q(s,  a)  for  all  s  G  S  and  a  G  .4(s)  according  to  Theorem  65.  The  [/-values  are 
admissible  iff,  for  all  S  =  s3ia  G  5,  —gd(s3<a)  <  U(s3<a)  <  0,  which  is  equivalent  to,  for 
all  sJ)0  G  G,  U(s3A)  =  0,  and,  for  all  s3<a  G  S\G,  -gd(s3A)  <  U(s3,a)  <  0. 

U(sa,a)  =  0  for  all  s3<a  G  G  is  equivalent  to  Q{s,a)  =  U{s3ia)  =  0  for  all  s  6  G. 

— gd(s3ta )  <  U(s3<a)  <  0  for  all  sJia  G  S  \  G  is  equivalent  to  —1  —  gd(succ(s,a))  = 
— gd(s3ia )  <  U(s3,a)  =  Q(s,a)  <  0  for  all  s  G  5  \  G  according  to  Theorem  63. 

Put  together,  it  follows  that  the  [/-values  of  the  transformed  domain  are  admissible  iff 
the  (^-values  of  the  original  domain  are  admissible. 
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